Java Execption while parsing JSON RDD from Kafka Stream - json

I am trying to read a json string from kafka using Spark stream library. The code is able to connect to kafka broker but fails while decoding the message. The code is inspired from
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson.scala
val kStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kParams, kTopic).map(_._2)
println("Starting to read from kafka topic:" + topicStr)
kStream.foreachRDD { rdd =>
if (rdd.toLocalIterator.nonEmpty) {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.read.json(rdd).registerTempTable("mytable")
if (firstTime) {
sqlContext.sql("SELECT * FROM mytable").printSchema()
}
val df = sqlContext.sql(selectStr)
df.collect.foreach(println)
df.rdd.saveAsTextFile(fileName)
mergeFiles(fileName, firstTime)
firstTime = false
println(rdd.name)
}
java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:222)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

The problem was with the version of Kafka jars used, using 0.9.0.0 fixed the issues. The class, kafka.message.MessageAndMetadata was introduced in 0.8.2.0.

Related

Issue reading multiline kafka messages in Spark

I'm trying to read multiline json message on Spark 2.0.0., but I'm getting _corrupt_record. The code works fine for a single line json and when I'm trying to read the multiline json it as wholetextfile in REPL.
stream.map(record => (record.key(), record.value())).foreachRDD(rdd => {
if (!rdd.isEmpty()) {
logger.info("----Start of the PersistIPDataRecords Batch processing------")
//taking only value part of each RDD
val newRDD = rdd.map(x => x._2.toString())
logger.info("--------------------Before Loop-----------------")
newRDD.foreach(println)
import spark.implicits._
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(newRDD).printSchema()
logger.info("----Converting RDD to Dataframe-----")
} else logger.info("---------No data received in RDD-----------")
})
ssc.start()
ssc.awaitTermination()
When I try reading it as file in REPL it works fine
scala> val df=spark.read.json(spark.sparkContext.wholeTextFiles("/user/maria_dev/jsondata/employees_multiLine.json").values)
JSON file:
{"empno":"7369", "ename":"SMITH", "designation":"CLERK", "manager":"7902", "hire_date":"12/17/1980", "sal":"800", "deptno":"20"}

How to Connect Spark SQL with My SQL Database Scala

Problem Statement:
Hi, I am a newbie to the Spark World. I want to query the MySQL Database and then load one table into the Spark. Then I want to apply some filter on the table using SQL Query. Once the result is filtered I want to return the result as JSON. All this we have to do from a standalone Scala base application.
I am struggling to initialize the Spark Context and getting an error. I know I am missing some piece of information.
Can Somebody have a look on the code and tell me what I need to do.
Code:
import application.ApplicationConstants
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, Dataset, Row, Column, SQLContext}
var sc: SparkContext = null
val sparkSession = SparkSession.builder().master("spark://10.62.10.71:7077")
.config("format","jdbc")
.config("url","jdbc:mysql://localhost:3306/test")
.config("user","root")
.config("password","")
.appName("MySQLSparkConnector")
.getOrCreate()
var conf = new SparkConf()
conf.setAppName("MongoSparkConnectorIntro")
.setMaster("local")
.set("format", "jdbc")
.set("url","jdbc:mysql://localhost:3306/test")
.set("user","root")
.set("password","")
sc = new SparkContext(conf)
val connectionProperties = new java.util.Properties
connectionProperties.put("user", username)
connectionProperties.put("password", password)
val customDF2 = sparkSession.read.jdbc(url,"employee",connectionProperties)
println("program ended")
Error:
Following is the error that I am getting:
64564 [main] ERROR org.apache.spark.SparkContext - Error initializing SparkContext.
java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
64566 [main] INFO org.apache.spark.SparkContext - SparkContext already stopped.
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at manager.SparkSQLMySQLDBConnector$.main(SparkSQLMySQLDBConnector.scala:21)
at manager.SparkSQLMySQLDBConnector.main(SparkSQLMySQLDBConnector.scala)
P.S: If anybody can give me any link or tutorial that is showing the similar scenario with Scala.
Versions:
Spark: 2.4.0
Scala: 2.12.8
MySQL Connector Jar: 8.0.13
I think you are messing around creating spark context and configs to connect MYSQL
IF you are using spark 2.0+ only use SparkSession as a entry-point as
val spark = SparkSession.builder().master("local[*]").appName("Test").getOrCreate
//Add Properties asbelow
val prop = new java.util.Properties()
prop.put("user", "user")
prop.put("password", "password")
val url = "jdbc:mysql://host:port/dbName"
Now read the table with as dataframe
val df = spark.read.jdbc(url, "tableName", prop)
To access sparkContext and sqlContext you can access from SparkSession as
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
Make sure you have mysql-connector-java jar in classpath, Add the dependency to your pom.xml or built.sbt
Hope this helps!

How to use CustomJsonParser to parse json string in Spark Structured Streaming?

Instead of parsing whole JSON string, user will provide a CustomJsonParser to parse partial JSON string into CustomObject. How to use this CustomJsonParser to convert JSON string in Spark Structured Streaming instead of using from_json and get_json_object methods?
Sample Code like this:
val jsonDF = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kakfaBrokers)
.option("subscribe", kafkaConsumeTopicName)
.option("group.id", kafkaConsumerGroupId)
.option("startingOffsets", startingOffsets)
.option("auto.offset.reset", autoOffsetReset)
.option("key.deserializer", classOf[StringDeserializer].getName)
.option("value.deserializer", classOf[StringDeserializer].getName)
.option("enable.auto.commit", "false")
.load()
val messagesDF = jsonDF.selectExpr("CAST(value AS STRING)")
spark.udf.register("parseJson", (json: String) =>
customJsonParser.parseJson(json)
)
val objDF = messagesDF.selectExpr("""parseJson(value) AS message""")
val query = objDF.writeStream
.outputMode(OutputMode.Append())
.format("console")
.start()
query.awaitTermination()
It runs with the following error:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.xxx.xxxEntity is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:693)
at
org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:159)

Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.
As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.
To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.
I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue. Code shown below:
val Array(zkQuorum, groupId, topics, numThreads) = args
val conf = new SparkConf()
.setAppName("KafkaAggregation")
// create sparkContext
val sc = new SparkContext(conf)
// streaming context
val ssc = new StreamingContext(conf, Seconds(1))
// ssc.checkpoint("hdfs://localhost:8020/usr/tmp/data")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap).map((_._2))
val lineJson = lines.map(JSON.parseFull(_))
.map(_.get.asInstanceOf[scala.collection.immutable.Map[String,Any]])
Error details:
error: not found: value JSON
[INFO] val lineJson = lines.map(JSON.parseFull(_))
Which maven dependency should use I to sort out the error?
I think you are looking for this:
import scala.util.parsing.json._
ANd adding Maven:
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-parser-combinators</artifactId>
<version>2.11.0-M4</version>
</dependency>
https://mvnrepository.com/artifact/org.scala-lang/scala-parser-combinators/2.11.0-M4