Parse JSON for Spark Structured Streaming - json

I am implemented Spark Structured Streaming, and for my use-case I have to specify the starting offsets.
And, I have the offset values in form of an Array[String]:
{"topic":"test","partition":0,"starting_offset":123}
{"topic":"test","partition":1,"starting_offset":456}
I would like to convert it to the below programmatically, so that I can pass it to Spark.
{"test":{"0":123,"1":456}}
Note: This is just a sample, I keep getting different offset ranges so I cannot hardcode it.

If array is the variable contaning the list you describe then:
>>> [{d['topic']: [d['partition'], d['starting_offset']]} for d in array]
[{'test': [0, 123]}, {'test': [1, 456]}]

scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(rawText => {
val json = parse(rawText)
val topicName = json \ "topic" // Extract topic value
val offsetForTopic = json \ "starting_offset" // Extract starting_offset
topicName -> offsetForTopic
})
scala> // Aggregate offsets for each topic
You can also use spark.sparkContext.parallelize API.
scala> case class KafkaTopic(topicName: String, partitionId: Int, starting_offset: Int)
scala> val spark: SparkSession = ???
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(line => json.parse(line).extract[KafkaTopic])
scala> val kafkaTopicDS = spark.sparkContext.parallelize(topicAsJSONs)
scala> val aggregatedOffsetsByTopic = kafkaTopicDS
.groupByKey("topic")
.mapGroups {
case (topicName, kafkaTopics) =>
val offsets = kafkaTopics.flatMap(kT => kT.starting_offset)
(topicName -> offsets.toSet)
}

Related

Compare key and values in json string using scala

Want to compare the first json string with the other 2 json string.
First the keys should match . If they match , then compare the nested key and values.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
Should throw Error for key mismatch.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of2 = "{\"keyA\":{\"1\":11,\"0\":200}}"
This should return true, as both keys match and also sub keys 1 and 0 have more values than sub key of json 2 and json 3.The numbers are Long values.
Please help.
Below is my try.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
def OffsetComparator(json1: String, json2: String, json3:String): Boolean = {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val jsonObj1 = mapper.readValue(json1, classOf[Map[String, Map[String, Long]]])
val jsonObj2 = mapper.readValue(json2, classOf[Map[String, Map[String, Long]]])
val jsonObj3 = mapper.readValue(json3, classOf[Map[String, Map[String, Long]]])
//Trying to get the key and compare first
val mapA = jsonObj1.keySet.foreach(i=>jsonObj1.keySet(i).toString)
val mapB = jsonObj2.keySet
val mapC = jsonObj3.keySet
println( (jsonObj1.keySet == jsonObj3.keySet) )
if (mapA.keySet != mapB.keySet || mapA.keySet != mapC.keySet) throw new Exception("partitions mismatch")
mapA.keys.forall(k => (mapA(k).asInstanceOf[Long] > mapB(k).asInstanceOf[Long] && mapA(k).asInstanceOf[Long] > mapC(k).asInstanceOf[Long]))
// getting error :java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long when i am casting as Long.Not su
}
println(OffsetComparator(of1, of2,of3))
}
You can try with https://github.com/gnieh/diffson. ITs available for circe, spray-json and play-json.
Your example with Circe:
import diffson._
import diffson.lcs._
import diffson.jsonpatch.lcsdiff._
import io.circe._
import diffson.circe._
import diffson.jsonpatch._
import io.circe.parser._
val decoder = Decoder[JsonPatch[Json]]
val encoder = Encoder[JsonPatch[Json]]
implicit val lcs = new Patience[Json]
val json1 = parse(of1)
val json2 = parse(of2)
val patch =
for {
json1 <- json1
json2 <- json2
} yield diff(json1, json2)
print(patch)
That gives:
Right(JsonPatch(List(Replace(Chain(Left(keyA), Left(0)),201,None), Replace(Chain(Left(keyA), Left(1)),12,None))))
take a look to see how it works https://index.scala-lang.org/gnieh/diffson/diffson-circe/4.0.3?target=_2.13
For Circe, indlude the dependence:
"org.gnieh" %% f"diffson-circe" % "4.0.3"

Extract a map from a Json interprets all numbers as BigInt

i've extracted a map from json. This works so far. As I don't know before parsing which fields are there in the json, I've been using a Map[String, Any]. Every field only consisting of digits is interpreted as a BigInt, which I don't want.
MyCode:
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats
json.extract[Map[String, Any]]
Any way to implicitly make the numbers interpreted as Int or Long?
You did not specify, how the json value is created. If you parse it from a String, than the useBigIntForLong flag does the trick:
import org.json4s.DefaultFormats
import org.json4s.JsonAST._
import org.json4s.native.JsonMethods
object Main {
def main(args: Array[String]): Unit = {
implicit val formats: DefaultFormats = DefaultFormats
val parsedJson = JsonMethods.parse(""" { "a" : 42} """, useBigIntForLong = false)
parsedJson.extract[Map[String, Any]].foreach {
case (name, value) => println(s"$name = $value (${value.getClass})")
}
}
}
Output:
a = 42 (class java.lang.Long)
If you construct the json value programmatically, than you choose between BigInt and Long directly:
val constructedJson = JObject(
"alwaysBigInt" -> JInt(42),
"alwaysLong" -> JLong(55),
)
constructedJson.extract[Map[String, Any]].foreach {
case (name, value) => println(s"$name = $value (${value.getClass})")
}
Output:
alwaysBigInt = 42 (class scala.math.BigInt)
alwaysLong = 55 (class java.lang.Long)
Example source code

Generic Play json Formatter

I have a scala application and have a case class like -
case class SR(
systemId: Option[String] = None,
x: Map[Timestamp, CaseClass1] = Map.empty,
y: Map[Timestamp, CaseClass2] = Map.empty,
y: Map[Timestamp, CaseClass3] = Map.empty
)
Now I have to provide an implicit read and write JSON format for properties x,y,z for SR case class like -
implicit val mapCMPFormat = new Format[Map[Timestamp, CaseClass1]] {
def writes(obj: Map[Timestamp, CaseClass1]): JsValue =
JsArray(obj.values.toSeq.map(Json.toJson(_)))
def reads(jv: JsValue): JsResult[Map[Timestamp, CaseClass1]] = jv.validate[scala.collection.Seq[CaseClass1]] match {
case JsSuccess(objs, path) => JsSuccess(objs.map(obj => obj.dataDate.get -> obj).toMap, path)
case err: JsError => err
}
}
And so on similarly for Y and Z, and in future, I will be adding many more properties like x,y,z in SR case class and then need to provide the formators.
So Can I get some Generic Formater that will take care for all types?
To my knowledge, a simple way to achieve this does not exists, however, to create a "default" reader for each object should not be hard to do, something like:
case class VehicleColorForAdd(
name: String,
rgb: String
)
object VehicleColorForAdd {
implicit val jsonFormat: Format[VehicleColorForAdd] = Json.formats[VehicleColorForAdd]
}
This way you have access to the implicit by simply using the object, so you could have other objects that contains this object with no problem:
case class BiggerModel(
vehicleColorForAdd: VehicleColorForAdd
)
object BiggerModel{
implicit val jsonFormat: Format[BiggerModel] = Json.format[BiggerModel]
}
Sadly, you need to do this for each class type, but you can "extend" play converters with your own, for example, this are some of my default readers:
package common.json
import core.order.Order
import org.joda.time.{ DateTime, LocalDateTime }
import org.joda.time.format.DateTimeFormat
import core.promotion.{ DailySchedule, Period }
import play.api.libs.functional.syntax._
import play.api.libs.json.Reads._
import play.api.libs.json._
import play.api.libs.json.{ JsError, JsPath, JsSuccess, Reads }
import scala.language.implicitConversions
/**
* General JSon readers and transformations.
*/
object JsonReaders {
val dateTimeFormat = "yyyy-MM-dd HH:mm:ss"
class JsPathHelper(val path: JsPath) {
def readTrimmedString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.trim)
def readUpperString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.toUpperCase)
def readNullableTrimmedString(implicit r: Reads[String]): Reads[Option[String]] = Reads.nullable[String](path)(r).map(_.map(_.trim))
}
implicit val localDateTimeReader: Reads[LocalDateTime] = Reads[LocalDateTime]((js: JsValue) =>
js.validate[String].map[LocalDateTime](dtString =>
LocalDateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
val localDateTimeWriter: Writes[LocalDateTime] = new Writes[LocalDateTime] {
def writes(d: LocalDateTime): JsValue = JsString(d.toString(dateTimeFormat))
}
implicit val localDateTimeFormat: Format[LocalDateTime] = Format(localDateTimeReader, localDateTimeWriter)
implicit val dateTimeReader: Reads[DateTime] = Reads[DateTime]((js: JsValue) =>
js.validate[String].map[DateTime](dtString =>
DateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
implicit def toJsPathHelper(path: JsPath): JsPathHelper = new JsPathHelper(path)
val defaultStringMax: Reads[String] = maxLength[String](255)
val defaultStringMinMax: Reads[String] = minLength[String](1) andKeep defaultStringMax
val rgbRegex: Reads[String] = pattern("""^#([\da-fA-F]{2})([\da-fA-F]{2})([\da-fA-F]{2})$""".r, "error.invalidRGBPattern")
val plateRegex: Reads[String] = pattern("""^[\d\a-zA-Z]*$""".r, "error.invalidPlatePattern")
val minOnlyWordsRegex: Reads[String] = minLength[String](2) keepAnd onlyWordsRegex
val positiveInt: Reads[Int] = min[Int](1)
val zeroPositiveInt: Reads[Int] = min[Int](0)
val zeroPositiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](0)
val positiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](1)
def validLocalDatePeriod()(implicit reads: Reads[Period]) =
Reads[Period](js => reads.reads(js).flatMap { o =>
if (o.startPeriod isAfter o.endPeriod)
JsError("error.startPeriodAfterEndPeriod")
else
JsSuccess(o)
})
def validLocalTimePeriod()(implicit reads: Reads[DailySchedule]) =
Reads[DailySchedule](js => reads.reads(js).flatMap { o =>
if (o.dailyStart isAfter o.dailyEnd)
JsError("error.dailyStartAfterDailyEnd")
else
JsSuccess(o)
})
}
Then, you only need to import this object to have access to all this implicit converters:
package common.forms
import common.json.JsonReaders._
import play.api.libs.json._
/**
* Form to add a model with only one string field.
*/
object SimpleCatalogAdd {
case class Data(
name: String
)
implicit val dataReads: Reads[Data] = (__ \ "name").readTrimmedString(defaultStringMinMax).map(Data.apply)
}

How to parse json with arbitrary schema, update/create one field and write it back as json(Scala)

how do one update/create field in JSON object with arbitrary schema and write it back as JSON in Scala?
I tried with spray-json with something like that:
import spray.json._
import DefaultJsonProtocol._
val jsonAst = """{"anyfield":"1234", "sought_optional_field":5.0}""".parse
val newValue = jsonAst.asJsObject.fields.getOrElse("sought_optional_field", 1)
val newMap = jsonAst.asJsObject.fields + ("sought_optional_field" -> newValue)
JSONObject(newMap).toJson
but it gives weird result: "{"anyfield"[ : "1234", "sought_optional_field" : ]1}
You were almost there :
import spray.json._
import DefaultJsonProtocol._
def changeField(json: String) = {
val jsonAst = JsonParser(json)
val map = jsonAst.asJsObject.fields
val sought = map.getOrElse("sought_optional_field", 1.toJson)
map.updated("sought_optional_field", sought).toJson
}
val jsonA = """{"anyfield":"1234", "sought_optional_field":5.0}"""
val jsonB = """{"anyfield":"1234"}"""
changeField(jsonA)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":5.0}
changeField(jsonB)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":1}
Using Argonaut:
import argonaut._, Argonaut._
def changeField2(json: String) =
json.parseOption.map( parsed =>
parsed.withObject(o =>
o + ("sought_optional_field", o("sought_optional_field").getOrElse(jNumber(1)))
)
)
changeField2(jsonA).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":5})
changeField2(jsonB).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":1})

Deserializer for List[T] types

I'm trying to deserialize a JsArray into a List[T] in a playframework application using Scala. After some research I found this method which is supposed to do the needed work:
/**
* Deserializer for List[T] types.
*/
implicit def listReads[T](implicit fmt: Reads[T]): Reads[List[T]] = new Reads[List[T]] {
def reads(json: JsValue) = json match {
case JsArray(ts) => ts.map(t => fromJson(t)(fmt)).toList
case _ => throw new RuntimeException("List expected")
}
}
The problem is that I didn't know how to use it. Any help is welcome.
Here's a quick example:
scala> import play.api.libs.json._
import play.api.libs.json._
scala> Json.toJson(List(1, 2, 3)).as[List[Int]]
res0: List[Int] = List(1, 2, 3)
And if you have a custom type with a Format instance:
case class Foo(i: Int, x: String)
implicit object fooFormat extends Format[Foo] {
def reads(json: JsValue) = Foo(
(json \ "i").as[Int],
(json \ "x").as[String]
)
def writes(foo: Foo) = JsObject(Seq(
"i" -> JsNumber(foo.i),
"x" -> JsString(foo.x)
))
}
It still works:
scala> val foos = Foo(1, "a") :: Foo(2, "bb") :: Nil
foos: List[Foo] = List(Foo(1,a), Foo(2,bb))
scala> val json = Json.toJson(foos)
json: play.api.libs.json.JsValue = [{"i":1,"x":"a"},{"i":2,"x":"bb"}]
scala> json.as[List[Foo]]
res1: List[Foo] = List(Foo(1,a), Foo(2,bb))
This approach would also work if your custom type had a xs: List[String] member, for example: you'd just use (json \ "xs").as[List[String]] in your reads method and Json.toJson(foo.xs) in your writes.