How to Connect Spark SQL with My SQL Database Scala - mysql

Problem Statement:
Hi, I am a newbie to the Spark World. I want to query the MySQL Database and then load one table into the Spark. Then I want to apply some filter on the table using SQL Query. Once the result is filtered I want to return the result as JSON. All this we have to do from a standalone Scala base application.
I am struggling to initialize the Spark Context and getting an error. I know I am missing some piece of information.
Can Somebody have a look on the code and tell me what I need to do.
Code:
import application.ApplicationConstants
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, Dataset, Row, Column, SQLContext}
var sc: SparkContext = null
val sparkSession = SparkSession.builder().master("spark://10.62.10.71:7077")
.config("format","jdbc")
.config("url","jdbc:mysql://localhost:3306/test")
.config("user","root")
.config("password","")
.appName("MySQLSparkConnector")
.getOrCreate()
var conf = new SparkConf()
conf.setAppName("MongoSparkConnectorIntro")
.setMaster("local")
.set("format", "jdbc")
.set("url","jdbc:mysql://localhost:3306/test")
.set("user","root")
.set("password","")
sc = new SparkContext(conf)
val connectionProperties = new java.util.Properties
connectionProperties.put("user", username)
connectionProperties.put("password", password)
val customDF2 = sparkSession.read.jdbc(url,"employee",connectionProperties)
println("program ended")
Error:
Following is the error that I am getting:
64564 [main] ERROR org.apache.spark.SparkContext - Error initializing SparkContext.
java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
64566 [main] INFO org.apache.spark.SparkContext - SparkContext already stopped.
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
P.S: If anybody can give me any link or tutorial that is showing the similar scenario with Scala.
Versions:
Spark: 2.4.0
Scala: 2.12.8
MySQL Connector Jar: 8.0.13

I think you are messing around creating spark context and configs to connect MYSQL
IF you are using spark 2.0+ only use SparkSession as a entry-point as
val spark = SparkSession.builder().master("local[*]").appName("Test").getOrCreate
//Add Properties asbelow
val prop = new java.util.Properties()
prop.put("user", "user")
prop.put("password", "password")
val url = "jdbc:mysql://host:port/dbName"
Now read the table with as dataframe
val df = spark.read.jdbc(url, "tableName", prop)
To access sparkContext and sqlContext you can access from SparkSession as
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
Make sure you have mysql-connector-java jar in classpath, Add the dependency to your pom.xml or built.sbt
Hope this helps!

Related

java.lang.ClassNotFoundException: org.json4s.JsonAST$JValue

I am trying a POC on kafka where I am loading a dataset to a topic and reading from it. I am trying to create a struct as follow to apply to the data that I will read from kafka topic:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{MapType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
//import org.apache.spark.sql.types.DataType.j
//import org.json4s._
//import org.json4s.na
val df = spark
.read
.format("kafka")
.options(Admin.commonOptions)
.option("subscribe", topic)
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
println("personStringDF--")
personStringDF.show()
personStringDF.printSchema()
val schemaTopic: StructType = StructType(
Array(
StructField(name = "col1", dataType = StringType, nullable = false),
StructField(name = "col2", dataType = StringType, nullable = false)
))
My BUILD file :
java_library(
name = "spark",
exports = [
"#maven//:org_apache_spark_spark_core_2_12",
"#maven//:org_apache_spark_spark_sql_2_12",
"#maven//:org_apache_spark_spark_unsafe_2_12",
"#maven//:org_apache_spark_spark_tags_2_12",
"#maven//:org_apache_spark_spark_catalyst_2_12",
"#maven//:com_fasterxml_jackson_core_jackson_annotations",
"#maven//:com_fasterxml_jackson_core_jackson_core",
"#maven//:com_fasterxml_jackson_core_jackson_databind",
"#maven//:com_typesafe_play_play_json_2_12_2_9_1",
"#maven//:org_json4s_json4s_ast_2_12_4_0_0",
"#maven//:org_json4s_json4s_jackson_2_12_4_0_0"
,
],
)
but getting Exception in thread "main" java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue
Can anybody help here not sure why am I getting this?
(running this code with Bazel I have a workspace file as well all these dependencies mentioned there this is runtime error bazel build is successful)
This issue is resolved by downgrading version of JSON dependencies to
3.6.6 in WORKSPACE file
"org.json4s:json4s-ast_2.12:3.6.6",
"org.json4s:json4s-core_2.12:3.6.6",
"org.json4s:json4s-jackson_2.12:3.6.6",
"org.json4s:json4s-scalap_2.12:3.6.6",
You missed some dependent json4s libs.
Bazel require to explicitly enumerate all needed dependencies in build file.

spark failing to read mysql data and save in hdfs [duplicate]

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"#"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes 20 to 30 sec.
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
This is taking 5 to 10 minutes.
I am running spark inside jvm. Using same configuration in both cases.
I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
This problem has been covered multiple times on StackOverflow:
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
Please follow the steps below
1.download a copy of the JDBC connector for mysql. I believe you already have one.
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
2.create a db-properties.flat file in the below format
jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
user=<username>
password=<password>
3.create a empty table first where you want to load the data.
invoke spark shell with driver class
spark-shell --driver-class-path <your path to mysql jar>
then import all the required package
import java.io.{File, FileInputStream}
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
initiate a hive context or a sql context
val sQLContext = new HiveContext(sc)
import sQLContext.implicits._
import sQLContext.sql
set some of the properties
sQLContext.setConf("hive.exec.dynamic.partition", "true")
sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Load mysql db properties from file
val dbProperties = new Properties()
dbProperties.load(new FileInputStream(new File("your_path_to/db- properties.flat")))
val jdbcurl = dbProperties.getProperty("jdbcUrl")
create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause
val df1 = "(SELECT * FROM your_table_name) as s1"
pass the jdbcurl, select query and db properties to read method
val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
write it to your table
df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")

What are SparkSession Config Options

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit

Java Execption while parsing JSON RDD from Kafka Stream

I am trying to read a json string from kafka using Spark stream library. The code is able to connect to kafka broker but fails while decoding the message. The code is inspired from
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson.scala
val kStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kParams, kTopic).map(_._2)
println("Starting to read from kafka topic:" + topicStr)
kStream.foreachRDD { rdd =>
if (rdd.toLocalIterator.nonEmpty) {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.read.json(rdd).registerTempTable("mytable")
if (firstTime) {
sqlContext.sql("SELECT * FROM mytable").printSchema()
}
val df = sqlContext.sql(selectStr)
df.collect.foreach(println)
df.rdd.saveAsTextFile(fileName)
mergeFiles(fileName, firstTime)
firstTime = false
println(rdd.name)
}
java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:222)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
The problem was with the version of Kafka jars used, using 0.9.0.0 fixed the issues. The class, kafka.message.MessageAndMetadata was introduced in 0.8.2.0.

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue. Code shown below:
val Array(zkQuorum, groupId, topics, numThreads) = args
val conf = new SparkConf()
.setAppName("KafkaAggregation")
// create sparkContext
val sc = new SparkContext(conf)
// streaming context
val ssc = new StreamingContext(conf, Seconds(1))
// ssc.checkpoint("hdfs://localhost:8020/usr/tmp/data")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap).map((_._2))
val lineJson = lines.map(JSON.parseFull(_))
.map(_.get.asInstanceOf[scala.collection.immutable.Map[String,Any]])
Error details:
error: not found: value JSON
[INFO] val lineJson = lines.map(JSON.parseFull(_))
Which maven dependency should use I to sort out the error?
I think you are looking for this:
import scala.util.parsing.json._
ANd adding Maven:
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-parser-combinators</artifactId>
<version>2.11.0-M4</version>
</dependency>
https://mvnrepository.com/artifact/org.scala-lang/scala-parser-combinators/2.11.0-M4