How to parse Json formatted Kafka message in spark streaming - json

I have JSON messages on Kafka like this:
{"id_post":"p1", "message":"blablabla"}
and I want to parse the message, and print (or use for further computation) the message element.
With the following code I print the json
val kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, inputGroup, topicMap)
val postStream = kafkaStream.map(_._2)
postStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0){
rdd.foreach(record => {
println(record)
}
}
but I can't manage to get the single element.
I tried a few JSON parser, but no luck.
Any idea?
update:
a few errors with different JSON parser
this is the code and output with circe parser:
val parsed_record = parse(record)
and the output:
14:45:00,676 ERROR Executor:95 - Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at io.circe.jawn.CirceSupportParser$$anon$1$$anon$4.add(CirceSupportParser.scala:36)
at jawn.CharBasedParser$class.parseString(CharBasedParser.scala:90)
at jawn.StringParser.parseString(StringParser.scala:15)
at jawn.Parser.rparse(Parser.scala:397)
at jawn.Parser.parse(Parser.scala:338)
at jawn.SyncParser.parse(SyncParser.scala:24)
at jawn.SupportParser$$anonfun$parseFromString$1.apply(SupportParser.scala:15)
and so on.. at the row in which I use parse(record)
looks like it can't access and/or parse the string record.
Same if I use lift-json
at the parse(record) the error output is more or less the same:
16:58:20,425 ERROR Executor:95 - Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.NoSuchMethodError: scala.runtime.ObjectRef.create(Ljava/lang/Object;)Lscala/runtime/ObjectRef;
at net.liftweb.json.JsonParser$$anonfun$2.apply(JsonParser.scala:144)
at net.liftweb.json.JsonParser$$anonfun$2.apply(JsonParser.scala:141)
at net.liftweb.json.JsonParser$.parse(JsonParser.scala:80)
at net.liftweb.json.JsonParser$.parse(JsonParser.scala:45)
at net.liftweb.json.package$.parse(package.scala:40)
at SparkConsumer$$anonfun$main$1$$anonfun$apply$1.apply(SparkConsumer.scala:98)
at SparkConsumer$$anonfun$main$1$$anonfun$apply$1.apply(SparkConsumer.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

I solved the issue, so I am writing here for future references:
dependencies, dependencies, dependecies!
I choose to use lift-json, but this applies to any JSON parser and/or framework.
The SPARK version I am using (v1.4.1) is the one compatible with scala 2.10, here the dependencies from pom.xml:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.4.1</version>
<scope>provided</scope>
</dependency>
and some other libraries. I was using the lift-json version for scala 2.11 ... and that is WRONG.
So, for the future me and if you are reading this topic: be coherent with scala version and among dependencies.
In lift-json case:
<dependency>
<groupId>net.liftweb</groupId>
<artifactId>lift-json_2.10</artifactId>
<version>3.0-M1</version>
</dependency>

same problem with you.
However I solved this problem by using the fastjson.
SBT dependency :
// http://mvnrepository.com/artifact/com.alibaba/fastjson
libraryDependencies += "com.alibaba" % "fastjson" % "1.2.12"
or
Maven dependency :
<!-- http://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.12</version>
</dependency>
You can have a try. Hope this would be helpful.

Extracting the Data from JSON String in Scala/Apache Spark
import org.apache.spark.rdd.RDD
object JsonData extends serializable{
def main(args: Array[String]): Unit = {
val msg = "{ \"id_post\":\"21\",\"message\":\"blablabla\"}";
val m1 = msgParse(msg)
println(m1.id_post)
}
case class SomeClass(id_post: String, message: String) extends serializable
def msgParse(msg: String): SomeClass = {
import org.json4s._
import org.json4s.native.JsonMethods._
implicit val formats = DefaultFormats
val m = parse(msg).extract[SomeClass]
return m
}
}
Below is the Maven Decency
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.10</artifactId>
<version>3.3.0</version>
</dependency>

Related

NoSuchMethodError while trying to generate JSON strings from a Scala Map using Lift-JSON

Getting NoSuchMethodError while trying to generate JSON strings from a Scala Map using Lift-JSON .
Was trying to execute the following program from scala cookbook
package com.sample
import net.liftweb.json.JsonAST
import net.liftweb.json.JsonDSL._
import net.liftweb.json.Printer.{compact,pretty}
object LiftJsonWithCollections extends App {
val json = List(1, 2, 3)
println(compact(JsonAST.render(json)))
val map = Map("fname" -> "Alvin", "lname" -> "Alexander")
println(compact(JsonAST.render(map)))
}
The following error I am getting
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.$colon$colon.hd$1()Ljava/lang/Object;
at net.liftweb.json.Printer$class.layout$1(JsonAST.scala:597)
at net.liftweb.json.Printer$class.compact(JsonAST.scala:605)
at net.liftweb.json.Printer$.compact(JsonAST.scala:583)
at net.liftweb.json.Printer$class.compact(JsonAST.scala:590)
at net.liftweb.json.Printer$.compact(JsonAST.scala:583)
at com.sample.LiftJsonWithCollections$.delayedEndpoint$com$sample$LiftJsonWithCollections$1(LiftJsonWIthCollection.scala:10)
at com.sample.LiftJsonWithCollections$delayedInit$body.apply(LiftJsonWIthCollection.scala:8)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.sample.LiftJsonWithCollections$.main(LiftJsonWIthCollection.scala:8)
at com.sample.LiftJsonWithCollections.main(LiftJsonWIthCollection.scala)
Can't resolve it.
I have added the following dependency in pom
<dependency>
<groupId>net.liftweb</groupId>
<artifactId>lift-json_2.11</artifactId>
<version>3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.json4s/json4s-jackson -->
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
<version>3.2.10</version>
</dependency>
When I removed the JsonAST method and write the following code it works ..
object LiftJsonWithCollections extends App {
implicit val formats = DefaultFormats
val json = List(1, 2, 3)
**println(compact(render(json)))**
val map = Map("fname" -> "Alvin", "lname" -> "Alexander")
**println(compact(render(map)))**
}
However I can find a render method in JsonAST API also. How does it work ?

Encoding/Decode shapeless records with circe

Upgrading circe from 0.4.1 to 0.7.0 broke the following code:
import shapeless._
import syntax.singleton._
import io.circe.generic.auto._
.run[Record.`'transaction_id -> Int`.T](transport)
def run[A](transport: Json => Future[Json])(implicit decoder: Decoder[A], exec: ExecutionContext): Future[A]
With the following error:
could not find implicit value for parameter decoder: io.circe.Decoder[shapeless.::[Int with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("transaction_id")],Int],shapeless.HNil]]
[error] .run[Record.`'transaction_id -> Int`.T](transport)
[error] ^
Am I missing some import here or are these encoders/decoders not available in circe anymore?
Instances for Shapeless's hlists, records, etc. were moved to a separate circe-shapes module in the circe 0.6.0 release. If you add this module to your build, the following should just work:
import io.circe.jawn.decode, io.circe.shapes._
import shapeless._, record.Record, syntax.singleton._
val doc = """{ "transaction_id": 1 }"""
val res = decode[Record.`'transaction_id -> Int`.T](doc)
The motivation for moving these instances was that the improved generic derivation introduced in 0.6 meant that they were no longer necessary, and keeping them out of implicit scope when they're not needed is both cleaner and potentially supports faster compile times. The new circe-shapes module also includes features that were not available in circe-generic, such as instances for coproducts.

Scala/Spark: NoClassDefFoundError: net/liftweb/json/Formats

I am trying to create a JSON String from a Scala Object as described here.
I have the following code:
import scala.collection.mutable._
import net.liftweb.json._
import net.liftweb.json.Serialization.write
case class Person(name: String, address: Address)
case class Address(city: String, state: String)
object LiftJsonTest extends App {
val p = Person("Alvin Alexander", Address("Talkeetna", "AK"))
// create a JSON string from the Person, then print it
implicit val formats = DefaultFormats
val jsonString = write(p)
println(jsonString)
}
My build.sbt file contains the following:
libraryDependencies += "net.liftweb" %% "lift-json" % "2.5+"
When I build with sbt package, it is a success.
However, when I try to run it as a Spark job, like this:
spark-submit \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0,net.liftweb:lift-json:2.5+ \
--class "com.foo.MyClass" \
--master local[4] \
target/scala-2.10/my-app_2.10-0.0.1.jar
I get this error:
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: net.liftweb#lift-json;2.5+: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1068)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What am I doing wrong here? Is net.liftweb:lift-json:2.5+ in my packages argument incorrect? Do I need to add a resolver in build.sbt?
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages.
2.5+ in your build.sbt is Ivy version matcher syntax, not actual artifact version needed for Maven coordinates. spark-submit apparently doesn't use Ivy for resolution (and I think it would be surprising if it did; your application could suddenly stop working because a new dependency version was published). So you need to find what version 2.5+ resolves to in your case e.g. using https://github.com/jrudolph/sbt-dependency-graph (or trying to find it in show dependencyClasspath).

Java Execption while parsing JSON RDD from Kafka Stream

I am trying to read a json string from kafka using Spark stream library. The code is able to connect to kafka broker but fails while decoding the message. The code is inspired from
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson.scala
val kStream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kParams, kTopic).map(_._2)
println("Starting to read from kafka topic:" + topicStr)
kStream.foreachRDD { rdd =>
if (rdd.toLocalIterator.nonEmpty) {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.read.json(rdd).registerTempTable("mytable")
if (firstTime) {
sqlContext.sql("SELECT * FROM mytable").printSchema()
}
val df = sqlContext.sql(selectStr)
df.collect.foreach(println)
df.rdd.saveAsTextFile(fileName)
mergeFiles(fileName, firstTime)
firstTime = false
println(rdd.name)
}
java.lang.NoSuchMethodError: kafka.message.MessageAndMetadata.(Ljava/lang/String;ILkafka/message/Message;JLkafka/serializer/Decoder;Lkafka/serializer/Decoder;)V
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:222)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
The problem was with the version of Kafka jars used, using 0.9.0.0 fixed the issues. The class, kafka.message.MessageAndMetadata was introduced in 0.8.2.0.

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue

I am reading JSON data from kafka and parsing the data using spark. But I end up with JSON parser issue. Code shown below:
val Array(zkQuorum, groupId, topics, numThreads) = args
val conf = new SparkConf()
.setAppName("KafkaAggregation")
// create sparkContext
val sc = new SparkContext(conf)
// streaming context
val ssc = new StreamingContext(conf, Seconds(1))
// ssc.checkpoint("hdfs://localhost:8020/usr/tmp/data")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap).map((_._2))
val lineJson = lines.map(JSON.parseFull(_))
.map(_.get.asInstanceOf[scala.collection.immutable.Map[String,Any]])
Error details:
error: not found: value JSON
[INFO] val lineJson = lines.map(JSON.parseFull(_))
Which maven dependency should use I to sort out the error?
I think you are looking for this:
import scala.util.parsing.json._
ANd adding Maven:
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-parser-combinators</artifactId>
<version>2.11.0-M4</version>
</dependency>
https://mvnrepository.com/artifact/org.scala-lang/scala-parser-combinators/2.11.0-M4