I am trying a POC on kafka where I am loading a dataset to a topic and reading from it. I am trying to create a struct as follow to apply to the data that I will read from kafka topic:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{MapType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
//import org.apache.spark.sql.types.DataType.j
//import org.json4s._
//import org.json4s.na
val df = spark
.read
.format("kafka")
.options(Admin.commonOptions)
.option("subscribe", topic)
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
println("personStringDF--")
personStringDF.show()
personStringDF.printSchema()
val schemaTopic: StructType = StructType(
Array(
StructField(name = "col1", dataType = StringType, nullable = false),
StructField(name = "col2", dataType = StringType, nullable = false)
))
My BUILD file :
java_library(
name = "spark",
exports = [
"#maven//:org_apache_spark_spark_core_2_12",
"#maven//:org_apache_spark_spark_sql_2_12",
"#maven//:org_apache_spark_spark_unsafe_2_12",
"#maven//:org_apache_spark_spark_tags_2_12",
"#maven//:org_apache_spark_spark_catalyst_2_12",
"#maven//:com_fasterxml_jackson_core_jackson_annotations",
"#maven//:com_fasterxml_jackson_core_jackson_core",
"#maven//:com_fasterxml_jackson_core_jackson_databind",
"#maven//:com_typesafe_play_play_json_2_12_2_9_1",
"#maven//:org_json4s_json4s_ast_2_12_4_0_0",
"#maven//:org_json4s_json4s_jackson_2_12_4_0_0"
,
],
)
but getting Exception in thread "main" java.lang.NoClassDefFoundError: org/json4s/JsonAST$JValue
Can anybody help here not sure why am I getting this?
(running this code with Bazel I have a workspace file as well all these dependencies mentioned there this is runtime error bazel build is successful)
This issue is resolved by downgrading version of JSON dependencies to
3.6.6 in WORKSPACE file
"org.json4s:json4s-ast_2.12:3.6.6",
"org.json4s:json4s-core_2.12:3.6.6",
"org.json4s:json4s-jackson_2.12:3.6.6",
"org.json4s:json4s-scalap_2.12:3.6.6",
You missed some dependent json4s libs.
Bazel require to explicitly enumerate all needed dependencies in build file.
Related
I am trying to make a dynamic schema creation out of JSON records from text file as every record will have different schema. The following is my code.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}
object streamingexample {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = spark.readStream.textFile("C:\\Users\\sheol\\Desktop\\streaming")
val newdf11=df1
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
val df2 = df1.select(from_json($"value", schema_of_json(json_schema)).alias("value_new"))
val df3 = df2.select($"value_new.*")
df3.printSchema()
df3.writeStream
.option("truncate", "false")
.format("console")
.start()
.awaitTermination()
}
}
I am getting the following error. Please help on how to fix the code. I tried a lot. unable to figure out.
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Sample data:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
This statement in your code causing the problem from your code, as you already know.
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
You can get json schema in a different way like below...
val dd: DataFrame = spark.read.json("C:\\Users\\sheol\\Desktop\\streaming")
dd.show()
/** you can use val df1 = spark.readStream.textFile(yourfile) also **/
val json_schema = dd.schema.json;
println(json_schema)
Result :
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]}
you can further refine to your requirements I will leave it to you
This exception occurred because you are trying to access the data from the stream before the stream was started. Issues is with the df3.printSchema() make sure to call this function after the stream start.
Problem Statement:
Hi, I am a newbie to the Spark World. I want to query the MySQL Database and then load one table into the Spark. Then I want to apply some filter on the table using SQL Query. Once the result is filtered I want to return the result as JSON. All this we have to do from a standalone Scala base application.
I am struggling to initialize the Spark Context and getting an error. I know I am missing some piece of information.
Can Somebody have a look on the code and tell me what I need to do.
Code:
import application.ApplicationConstants
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, Dataset, Row, Column, SQLContext}
var sc: SparkContext = null
val sparkSession = SparkSession.builder().master("spark://10.62.10.71:7077")
.config("format","jdbc")
.config("url","jdbc:mysql://localhost:3306/test")
.config("user","root")
.config("password","")
.appName("MySQLSparkConnector")
.getOrCreate()
var conf = new SparkConf()
conf.setAppName("MongoSparkConnectorIntro")
.setMaster("local")
.set("format", "jdbc")
.set("url","jdbc:mysql://localhost:3306/test")
.set("user","root")
.set("password","")
sc = new SparkContext(conf)
val connectionProperties = new java.util.Properties
connectionProperties.put("user", username)
connectionProperties.put("password", password)
val customDF2 = sparkSession.read.jdbc(url,"employee",connectionProperties)
println("program ended")
Error:
Following is the error that I am getting:
64564 [main] ERROR org.apache.spark.SparkContext - Error initializing SparkContext.
java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
64566 [main] INFO org.apache.spark.SparkContext - SparkContext already stopped.
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
P.S: If anybody can give me any link or tutorial that is showing the similar scenario with Scala.
Versions:
Spark: 2.4.0
Scala: 2.12.8
MySQL Connector Jar: 8.0.13
I think you are messing around creating spark context and configs to connect MYSQL
IF you are using spark 2.0+ only use SparkSession as a entry-point as
val spark = SparkSession.builder().master("local[*]").appName("Test").getOrCreate
//Add Properties asbelow
val prop = new java.util.Properties()
prop.put("user", "user")
prop.put("password", "password")
val url = "jdbc:mysql://host:port/dbName"
Now read the table with as dataframe
val df = spark.read.jdbc(url, "tableName", prop)
To access sparkContext and sqlContext you can access from SparkSession as
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
Make sure you have mysql-connector-java jar in classpath, Add the dependency to your pom.xml or built.sbt
Hope this helps!
I just downloaded Spark 2.2 from the website, and created a simple project with the example from here.
The code is this:
import java.util.Properties
import org.apache.spark
object MysqlTest {
def main(args: Array[String]) {
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost/hap")
.option("dbtable", "hap.users")
.option("user", "***")
.option("password", "***")
.load()
}
}
The problem is that apparently spark.read does not exist.
I guess the Spark API's documentation is not up to date and the examples do not work. I would appreciate a working example.
I think you need this :
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Yo bro")
.getOrCreate()
The docs should be correct, but you skipped over the strt where the initialization is explained.https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession
The convention whith spark docs is the spark is a SparkSession instance, so that needs to be created first. You do this with the SparkSessionBuilder.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit
I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue. Code shown below:
val Array(zkQuorum, groupId, topics, numThreads) = args
val conf = new SparkConf()
.setAppName("KafkaAggregation")
// create sparkContext
val sc = new SparkContext(conf)
// streaming context
val ssc = new StreamingContext(conf, Seconds(1))
// ssc.checkpoint("hdfs://localhost:8020/usr/tmp/data")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap).map((_._2))
val lineJson = lines.map(JSON.parseFull(_))
.map(_.get.asInstanceOf[scala.collection.immutable.Map[String,Any]])
Error details:
error: not found: value JSON
[INFO] val lineJson = lines.map(JSON.parseFull(_))
Which maven dependency should use I to sort out the error?
I think you are looking for this:
import scala.util.parsing.json._
ANd adding Maven:
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-parser-combinators</artifactId>
<version>2.11.0-M4</version>
</dependency>
https://mvnrepository.com/artifact/org.scala-lang/scala-parser-combinators/2.11.0-M4