Make permanent connection with MySQL (Apache Spark) - mysql

If I execute these commands in spark-shell it correctly returns the data in the "people" table:
val dataframe_mysql = spark.sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/db_spark").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "people").option("user", "root").option("password", "****").load()
dataframe_mysql.show
The problem is if I close spark-shell and return it open, the connection to the database is not maintained.

As per Spark's documentation, SparkContext and HiveContext get created inside the spark shell (when spark-shell command is executed) with HiveContext defined as SQLContext. As the connection is mapped to SQLContext, closing the spark shell will mean you won't be able to access SQLContext and hence, you won't be able to connect.
Here's another reference:
When you run spark-shell, which is your interactive driver
application, it automatically creates a SparkContext defined as sc and
a HiveContext defined as sqlContext.

Related

Cant connect to Mysql database from pyspark, getting jdbc error

I am learning pyspark, and trying to connect to a mysql database.
But i am getting a java.lang.ClassNotFoundException: com.mysql.jdbc.Driver Exception while running the code. I have spent a whole day trying to fix it, any help would be appreciated :)
I am using pycharm community edition with anaconda and python 3.6.3
Here is my code:
from pyspark import SparkContext,SQLContext
sc= SparkContext()
sqlContext= SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
url ="jdbc:mysql://192.168.0.11:3306/my_db_name",
driver = "com.mysql.jdbc.Driver",
dbtable = "billing",
user="root",
password="root").load()
Here is the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o27.load.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
This got asked 9 months ago at the time of writing, but since there's no answer, there it goes. I was in the same situation, searched stackoverflow over and over, tried different suggestions but the answer finally is absurdly simple: You just have to COPY the MySQL driver into the "jars" folder of Spark!
Download here https://dev.mysql.com/downloads/connector/j/5.1.html
I'm using the 5.1 version, although 8.0 exists, but I had some other problems when running the latest version with Spark 2.3.2 (had also other problems running Spark 2.4 on Windows 10).
Once downloaded you can just copy it into your Spark folder
E:\spark232_hadoop27\jars\ (use your own drive:\folder_name -- this is just an example)
You should have two files:
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47-bin.jar
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47.jar
After that the following code launched through pyCharm or jupyter notebook should work (as long as you have a MySQL database set up, that is):
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe_mysql = spark.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/uoc2",
driver = "com.mysql.jdbc.Driver",
dbtable = "company",
user="root",
password="password").load()
dataframe_mysql.show()
Bear in mind, I'm working currently locally with my Spark setup, so no real clusters involved, and also no "production" kind of code which gets submitted to such a cluster. For something more elaborate this answer could help: MySQL read with PySpark
On my computer, #Kondado 's solution works only if I change the driver in the options:
driver = 'com.mysql.cj.jdbc.Driver'
I am using Spark 8.0 on Windows. I downloaded mysql-connector-java-8.0.15.jar, Platform Independent version from here. And copy it to 'C:\spark-2.4.0-bin-hadoop2.7\jars\'
My code in Pycharm looks like this:
#import findspark # not necessary
#findspark.init() # not necessary
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()
I dont know how to add jar file to ClassPath(can someone tell me how??) so I put it in the SparkSession config and it works fine.
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "<path to mysql-connector-java-5.1.49-bin.jar>") \
.getOrCreate()
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()
df.show()
This worked for me, pyspark with mssql
java version is 1.7.0_191
pyspark version is 2.1.2
Download the below jar files
sqljdbc41.jar
mssql-jdbc-6.2.2.jre7.jar
Paste the above jars inside jars folder in the virtual environment
test_env/lib/python3.6/site-packages/pyspark/jars
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practise').getOrCreate()
url = 'jdbc:sqlserver://your_host_name:your_port;databaseName=YOUR_DATABASE_NAME;useNTLMV2=true;'
df = spark.read.format('jdbc'
).option('url', url
).option('user', 'your_db_username'
).option('password','your_db_password'
).option('dbtable', 'YOUR_TABLE_NAME'
).option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
).load()

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Run spark-shell command in shell script

#!/bin/sh
spark-shell
import org.apache.spark.sql.SparkSession
val url="jdbc:mysql://localhost:3306/slow_and_tedious"
val prop = new java.util.Properties
prop.setProperty("user",”scalauser”)
prop.setProperty("password","scalauser123")
val people = spark.read.jdbc(url,"sat",prop)
The above commands are used to make a connection between Mysql and Spark using JDBC.
But instead of writing these commands everytime I thought of make a script but when I run the above script it throws this error.
Create scala file named as test.scala with your code like below
import org.apache.spark.sql.SparkSession
val url="jdbc:mysql://localhost:3306/slow_and_tedious"
val prop = new java.util.Properties
prop.setProperty("user",”scalauser”)
prop.setProperty("password","scalauser123")
val people = spark.read.jdbc(url,"sat",prop)
Login to spark-shell using following command.
spark-shell --jars mysql-connector.jar
you can use following command to execute the code which you created above.
scala> :load /path/test.scala
shell script every time it launch sparkContext which takes more time to execute.
If you use above command it will just execute the code which is there in test.scala.
Since sparkContext will be loaded when you logging into spark-shell, time can be saved when you execute script.
Try this,
Write your code in a filex.txt
In your unix shell script include the following
cat filex.txt | spark-shell
Seemingly, you cant push the script in Background (using &)
you can paste your script in a file,then execute
spark-shell < {your file name}

Spark Read.json cant find file

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).
I try running this code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("people.json")
I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-
Py4JJavaError: An error occurred while calling o238.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist
Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation

How to use Spark DataFrame with MySQL

OK, I have known that I could use jdbc connector to create DataFrame with this command:
val jdbcDF = sqlContext.load("jdbc",
Map("url" -> "jdbc:mysql://localhost:3306/video_rcmd?user=root&password=123456",
"dbtable" -> "video"))
But I got this error: java.sql.SQLException: No suitable driver found for ...
And I have tried to add jdbc jar to spark_path with both commands but failed:
spark-shell --jars mysql-connector-java-5.0.8-bin.jar
SPARK_CLASSPATH=mysql-connector-java-5.0.8-bin.jar spark-shell
My Spark version is 1.3.0 while Class.forName("com.mysql.jdbc.Driver").newInstance is worked.
It is caused because the data frame does the find the the Mysql Connector Jar in the class path. This can be resolved by adding the jar to the spark class path as below:
Edit /spark/bin/compute-classpath.sh as
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR:yourPathToJar/mysql-connector-java-5.0.8-bin.jar"
Save the file and Restart the spark.
You might want to try mysql-connector-java-5.1.29-bin.jar