How to use CustomJsonParser to parse json string in Spark Structured Streaming? - json

Instead of parsing whole JSON string, user will provide a CustomJsonParser to parse partial JSON string into CustomObject. How to use this CustomJsonParser to convert JSON string in Spark Structured Streaming instead of using from_json and get_json_object methods?
Sample Code like this:
val jsonDF = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kakfaBrokers)
.option("subscribe", kafkaConsumeTopicName)
.option("group.id", kafkaConsumerGroupId)
.option("startingOffsets", startingOffsets)
.option("auto.offset.reset", autoOffsetReset)
.option("key.deserializer", classOf[StringDeserializer].getName)
.option("value.deserializer", classOf[StringDeserializer].getName)
.option("enable.auto.commit", "false")
.load()
val messagesDF = jsonDF.selectExpr("CAST(value AS STRING)")
spark.udf.register("parseJson", (json: String) =>
customJsonParser.parseJson(json)
)
val objDF = messagesDF.selectExpr("""parseJson(value) AS message""")
val query = objDF.writeStream
.outputMode(OutputMode.Append())
.format("console")
.start()
query.awaitTermination()
It runs with the following error:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type com.xxx.xxxEntity is not supported at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:755)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:693)
at
org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:159)

Related

Spark Dataframe to StringType

In PySpark, how do I convert a Dataframe to normal String?
Background:
I'm using PySpark with Kafka and instead of hard coding broker name, I have parameterized Kafka broker name in PySpark.
Json file is holding the Broker details and Spark read this Json input and assign values to variable. These variables are of Dataframe type with String.
I'm facing issue when I pass dataframe to Pyspark-Kakfa connection details to substitute the values.
Error :
Can only concatenate String (Not a Dataframe) to String.
Json parameter file :
{
"broker": "https://at.com:8082",
"topicname": "dev_hello"
}
PySpark Code :
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker")
ktopic = parameter.select("topicname")
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value")
.write
.format("kafka")
.outputMode("append")
.option("kafka.bootstrap.servers", "f"+ **kserver**)
.option("topic", "josn_data_topic",**ktopic** )
.save()
Please advise on it.
my second query is how do I pass these Python based variables to another Scala based Spark notebook.
Use json.load instead of Spark json reader:
import json
with open("/at/dev_parameter.json") as f:
parameter = json.load(f)
kserver = parameter["broker"]
ktopic = parameter["topicname"]
df.selectExpr("CAST(id AS STRING) AS key", "to_json(struct(*)) AS value") \
.write \
.format("kafka") \
.outputMode("append") \
.option("kafka.bootstrap.servers", kserver) \
.option("topic", ktopic) \
.save()
If you prefer using Spark json reader, you can do:
parameter = spark.read.option("multiline", "true").json("/at/dev_parameter.json")
kserver = parameter.select("broker").head()[0]
ktopic = parameter.select("topicname").head()[0]

Issue reading multiline kafka messages in Spark

I'm trying to read multiline json message on Spark 2.0.0., but I'm getting _corrupt_record. The code works fine for a single line json and when I'm trying to read the multiline json it as wholetextfile in REPL.
stream.map(record => (record.key(), record.value())).foreachRDD(rdd => {
if (!rdd.isEmpty()) {
logger.info("----Start of the PersistIPDataRecords Batch processing------")
//taking only value part of each RDD
val newRDD = rdd.map(x => x._2.toString())
logger.info("--------------------Before Loop-----------------")
newRDD.foreach(println)
import spark.implicits._
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(newRDD).printSchema()
logger.info("----Converting RDD to Dataframe-----")
} else logger.info("---------No data received in RDD-----------")
})
ssc.start()
ssc.awaitTermination()
When I try reading it as file in REPL it works fine
scala> val df=spark.read.json(spark.sparkContext.wholeTextFiles("/user/maria_dev/jsondata/employees_multiLine.json").values)
JSON file:
{"empno":"7369", "ename":"SMITH", "designation":"CLERK", "manager":"7902", "hire_date":"12/17/1980", "sal":"800", "deptno":"20"}

Turn string into simple JSON in scala

I have a string in scala which in terms of formatting, it is a json, for example
{"name":"John", "surname":"Doe"}
But when I generate this value it is initally a string. I need to convert this string into a json but I cannot change the output of the source. So how can I do this conversion in Scala? (I cannot use the Play Json library.)
If you have strings as
{"name":"John", "surname":"Doe"}
and if you want to save to elastic as mentioned here then you should use parseRaw instead of parseFull.
parseRaw will return you JSONType and parseFull will return you map
You can do as following
import scala.util.parsing.json._
val jsonString = "{\"name\":\"John\", \"surname\":\"Doe\"}"
val parsed = JSON.parseRaw(jsonString).get.toString()
And then use the jsonToEs api as
sc.makeRDD(Seq(parsed)).saveJsonToEs("spark/json-trips")
Edited
As #Aivean pointed out, when you already have json string from source, you won't be needing to convert to json, you can just do
if jsonString is {"name":"John", "surname":"Doe"}
sc.makeRDD(Seq(jsonString)).saveJsonToEs("spark/json-trips")
You can use scala.util.parsing.json to convert JSON in string format to JSON (which is basically HashMap datastructure),
eg.
scala> import scala.util.parsing.json._
import scala.util.parsing.json._
scala> val json = JSON.parseFull("""{"name":"John", "surname":"Doe"}""")
json: Option[Any] = Some(Map(name -> John, surname -> Doe))
To navigate the json format,
scala> json match { case Some(jsonMap : Map[String, Any]) => println(jsonMap("name")) case _ => println("json is empty") }
John
nested json example,
scala> val userJsonString = """{"name":"John", "address": { "perm" : "abc", "temp" : "zyx" }}"""
userJsonString: String = {"name":"John", "address": { "perm" : "abc", "temp" : "zyx" }}
scala> val json = JSON.parseFull(userJsonString)
json: Option[Any] = Some(Map(name -> John, address -> Map(perm -> abc, temp -> zyx)))
scala> json.map(_.asInstanceOf[Map[String, Any]]("address")).map(_.asInstanceOf[Map[String, String]]("perm"))
res7: Option[String] = Some(abc)

Java Execption while parsing JSON RDD from Kafka Stream

I am trying to read a json string from kafka using Spark stream library. The code is able to connect to kafka broker but fails while decoding the message. The code is inspired from
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson.scala
val kStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kParams, kTopic).map(_._2)
println("Starting to read from kafka topic:" + topicStr)
kStream.foreachRDD { rdd =>
if (rdd.toLocalIterator.nonEmpty) {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.read.json(rdd).registerTempTable("mytable")
if (firstTime) {
sqlContext.sql("SELECT * FROM mytable").printSchema()
}
val df = sqlContext.sql(selectStr)
df.collect.foreach(println)
df.rdd.saveAsTextFile(fileName)
mergeFiles(fileName, firstTime)
firstTime = false
println(rdd.name)
}
java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:222)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
The problem was with the version of Kafka jars used, using 0.9.0.0 fixed the issues. The class, kafka.message.MessageAndMetadata was introduced in 0.8.2.0.

Parsing JSON with Scala lift

I am trying to parse a json string with special characters in its attributes names (dots).
This is what I'm trying:
//Json parser objects
case class SolrDoc(`rdf.about`:String, `dc.title`:List[String],
`dc.creator`:List[String], `dc.dateCopyrighted`:List[Int],
`dc.publisher`:List[String], `dc.type` :String)
case class SolrResponse(numFound:String, start:String, docs: List[SolrDoc])
val req = url("http://localhost:8983/solr/select") <<? Map("q" -> q)
var search_result = http(req ># { json => (json \ "response") })
var response = search_result.extract[SolrResponse]
Even though my json string contains values for all the fields this is the error I'm getting:
Message: net.liftweb.json.MappingException: No usable value for docs
No usable value for rdf$u002Eabout
Did not find value which can be converted into java.lang.String
I suspect that it has something to do with the dot on the names but so far I did not manage to make it work.
Thanks!
These is an extract from my LiftProject.scala file :
"net.databinder" % "dispatch-http_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-http-json_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-lift-json_2.8.1" % "0.8.6"
Dots in names should not be a problem. This is with lift-json-2.4-M4
scala> val json = """ {"first.name":"joe"} """
scala> parse(json).extract[Person]
res0: Person = Person(joe)
Where
case class Person(`first.name`: String)