Run spark-shell command in shell script - mysql

#!/bin/sh
spark-shell
import org.apache.spark.sql.SparkSession
val url="jdbc:mysql://localhost:3306/slow_and_tedious"
val prop = new java.util.Properties
prop.setProperty("user",”scalauser”)
prop.setProperty("password","scalauser123")
val people = spark.read.jdbc(url,"sat",prop)
The above commands are used to make a connection between Mysql and Spark using JDBC.
But instead of writing these commands everytime I thought of make a script but when I run the above script it throws this error.

Create scala file named as test.scala with your code like below
import org.apache.spark.sql.SparkSession
val url="jdbc:mysql://localhost:3306/slow_and_tedious"
val prop = new java.util.Properties
prop.setProperty("user",”scalauser”)
prop.setProperty("password","scalauser123")
val people = spark.read.jdbc(url,"sat",prop)
Login to spark-shell using following command.
spark-shell --jars mysql-connector.jar
you can use following command to execute the code which you created above.
scala> :load /path/test.scala
shell script every time it launch sparkContext which takes more time to execute.
If you use above command it will just execute the code which is there in test.scala.
Since sparkContext will be loaded when you logging into spark-shell, time can be saved when you execute script.

Try this,
Write your code in a filex.txt
In your unix shell script include the following
cat filex.txt | spark-shell
Seemingly, you cant push the script in Background (using &)

you can paste your script in a file,then execute
spark-shell < {your file name}

Related

Caused by: com.typesafe.config.ConfigException$Missing

when i run my apps in Idea it is good. But when i package and run with java -jar in linux. I got this exception.
exception
the problem code is
val config = ConfigFactory.parseFile(new File("src/main/resources/master.conf"))
master = context.actorSelection(s"akka.tcp://masterSystem#${config.getString("akka.remote.netty.tcp.hostname")}:${config.getString("akka.remote.netty.tcp.port")}/user/Master")
the structure of my module
src/main/resources path doesn't exist in your jar and that's by design.
Try first getting your config in your code like this:
val config = ConfigFactor.load()
and then pass java -Dconfig.location=master.conf jar ... to your program

Cant connect to Mysql database from pyspark, getting jdbc error

I am learning pyspark, and trying to connect to a mysql database.
But i am getting a java.lang.ClassNotFoundException: com.mysql.jdbc.Driver Exception while running the code. I have spent a whole day trying to fix it, any help would be appreciated :)
I am using pycharm community edition with anaconda and python 3.6.3
Here is my code:
from pyspark import SparkContext,SQLContext
sc= SparkContext()
sqlContext= SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
url ="jdbc:mysql://192.168.0.11:3306/my_db_name",
driver = "com.mysql.jdbc.Driver",
dbtable = "billing",
user="root",
password="root").load()
Here is the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o27.load.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
This got asked 9 months ago at the time of writing, but since there's no answer, there it goes. I was in the same situation, searched stackoverflow over and over, tried different suggestions but the answer finally is absurdly simple: You just have to COPY the MySQL driver into the "jars" folder of Spark!
Download here https://dev.mysql.com/downloads/connector/j/5.1.html
I'm using the 5.1 version, although 8.0 exists, but I had some other problems when running the latest version with Spark 2.3.2 (had also other problems running Spark 2.4 on Windows 10).
Once downloaded you can just copy it into your Spark folder
E:\spark232_hadoop27\jars\ (use your own drive:\folder_name -- this is just an example)
You should have two files:
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47-bin.jar
E:\spark232_hadoop27\jars\mysql-connector-java-5.1.47.jar
After that the following code launched through pyCharm or jupyter notebook should work (as long as you have a MySQL database set up, that is):
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe_mysql = spark.read.format("jdbc").options(
url="jdbc:mysql://localhost:3306/uoc2",
driver = "com.mysql.jdbc.Driver",
dbtable = "company",
user="root",
password="password").load()
dataframe_mysql.show()
Bear in mind, I'm working currently locally with my Spark setup, so no real clusters involved, and also no "production" kind of code which gets submitted to such a cluster. For something more elaborate this answer could help: MySQL read with PySpark
On my computer, #Kondado 's solution works only if I change the driver in the options:
driver = 'com.mysql.cj.jdbc.Driver'
I am using Spark 8.0 on Windows. I downloaded mysql-connector-java-8.0.15.jar, Platform Independent version from here. And copy it to 'C:\spark-2.4.0-bin-hadoop2.7\jars\'
My code in Pycharm looks like this:
#import findspark # not necessary
#findspark.init() # not necessary
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()
I dont know how to add jar file to ClassPath(can someone tell me how??) so I put it in the SparkSession config and it works fine.
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", "<path to mysql-connector-java-5.1.49-bin.jar>") \
.getOrCreate()
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()
df.show()
This worked for me, pyspark with mssql
java version is 1.7.0_191
pyspark version is 2.1.2
Download the below jar files
sqljdbc41.jar
mssql-jdbc-6.2.2.jre7.jar
Paste the above jars inside jars folder in the virtual environment
test_env/lib/python3.6/site-packages/pyspark/jars
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Practise').getOrCreate()
url = 'jdbc:sqlserver://your_host_name:your_port;databaseName=YOUR_DATABASE_NAME;useNTLMV2=true;'
df = spark.read.format('jdbc'
).option('url', url
).option('user', 'your_db_username'
).option('password','your_db_password'
).option('dbtable', 'YOUR_TABLE_NAME'
).option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver'
).load()

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Make permanent connection with MySQL (Apache Spark)

If I execute these commands in spark-shell it correctly returns the data in the "people" table:
val dataframe_mysql = spark.sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/db_spark").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "people").option("user", "root").option("password", "****").load()
dataframe_mysql.show
The problem is if I close spark-shell and return it open, the connection to the database is not maintained.
As per Spark's documentation, SparkContext and HiveContext get created inside the spark shell (when spark-shell command is executed) with HiveContext defined as SQLContext. As the connection is mapped to SQLContext, closing the spark shell will mean you won't be able to access SQLContext and hence, you won't be able to connect.
Here's another reference:
When you run spark-shell, which is your interactive driver
application, it automatically creates a SparkContext defined as sc and
a HiveContext defined as sqlContext.

Gson import error in Scala.

I am using a Gson library for parsing Json data. I am trying to run a program from terminal as follows:
scala -classpath "*.jar" JsonParsing.scala
To which I am getting the following error:
JsonParsing.scala:2: error: object google is not a member of package com import com.google.gson.Gson
I am unsure as why this error is coming. When I have gson jar in accurate folder.
gson-2.2.2.jar
I am using import statements as follows:
import com.google.gson.Gson
import com.google.gson.JsonObject
import com.google.gson.JsonParser
Help on this error would be appreciated. Thanks.
Your dependancy not include google package.
You can use :
// https://mvnrepository.com/artifact/com.google.code.gson/gson
libraryDependencies += "com.google.code.gson" % "gson" % "2.8.0"
or download appropriate jar http://www.java2s.com/Code/Jar/g/gson.htm
Compile:
$ scalac -classpath <path_to_your_jar_files> -d classes " path/to/classes/you/want/to/compile/*
Execute:
$ scala -classpath classes:<path_to_your_jar_files> com.your.package.ClassYouWantToRun
This is not a good way of doing it because it's not scalable. You should be using a tool like SBT to build and run projects.