toDF() not getting supported

toDF() not getting supported - json

I am very new to Scala. I am trying to convert an Iterable[dataSet[Row]] to a dataframe. Its not working for me. Here is the code
def execute(spark: SparkSession,
input: Iterable[Dataset[Row]],
execParams: Map[String, String]): Dataset[Row] = {
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val sparkSession: SparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val jsonSeq = Seq(input)
val jsonRDD = sparkSession.sparkContext.parallelize(jsonSeq)
val jsonDF = jsonRDD.toDF()
}

You can first convert to a Dataset and then convert to a Dataframe
def execute(spark: SparkSession,
input: Iterable[Dataset[Row]],
execParams: Map[String, String]): Dataset[Row] = {
import spark.implicits._
val jsonSeq = Seq(input)
val jsonRDD = spark.sparkContext.parallelize(jsonSeq)
val jsonDF = spark.createDataset(jsonRDD).toDF()
}
If you don't like converting to a Dataset, you can specify the type of RDD:
val jsonRDD: RDD[Iterable[Dataset[Row]]] = spark.sparkContext.parallelize(jsonSeq)
val jsonDF = spark.createDataFrame[Iterable[Dataset[Row]]](jsonRDD).toDF()

Related

Compare key and values in json string using scala

Want to compare the first json string with the other 2 json string.
First the keys should match . If they match , then compare the nested key and values.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
Should throw Error for key mismatch.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of2 = "{\"keyA\":{\"1\":11,\"0\":200}}"
This should return true, as both keys match and also sub keys 1 and 0 have more values than sub key of json 2 and json 3.The numbers are Long values.
Please help.
Below is my try.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
def OffsetComparator(json1: String, json2: String, json3:String): Boolean = {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val jsonObj1 = mapper.readValue(json1, classOf[Map[String, Map[String, Long]]])
val jsonObj2 = mapper.readValue(json2, classOf[Map[String, Map[String, Long]]])
val jsonObj3 = mapper.readValue(json3, classOf[Map[String, Map[String, Long]]])
//Trying to get the key and compare first
val mapA = jsonObj1.keySet.foreach(i=>jsonObj1.keySet(i).toString)
val mapB = jsonObj2.keySet
val mapC = jsonObj3.keySet
println( (jsonObj1.keySet == jsonObj3.keySet) )
if (mapA.keySet != mapB.keySet || mapA.keySet != mapC.keySet) throw new Exception("partitions mismatch")
mapA.keys.forall(k => (mapA(k).asInstanceOf[Long] > mapB(k).asInstanceOf[Long] && mapA(k).asInstanceOf[Long] > mapC(k).asInstanceOf[Long]))
// getting error :java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long when i am casting as Long.Not su
}
println(OffsetComparator(of1, of2,of3))
}

You can try with https://github.com/gnieh/diffson. ITs available for circe, spray-json and play-json.
Your example with Circe:
import diffson._
import diffson.lcs._
import diffson.jsonpatch.lcsdiff._
import io.circe._
import diffson.circe._
import diffson.jsonpatch._
import io.circe.parser._
val decoder = Decoder[JsonPatch[Json]]
val encoder = Encoder[JsonPatch[Json]]
implicit val lcs = new Patience[Json]
val json1 = parse(of1)
val json2 = parse(of2)
val patch =
for {
json1 <- json1
json2 <- json2
} yield diff(json1, json2)
print(patch)
That gives:
Right(JsonPatch(List(Replace(Chain(Left(keyA), Left(0)),201,None), Replace(Chain(Left(keyA), Left(1)),12,None))))
take a look to see how it works https://index.scala-lang.org/gnieh/diffson/diffson-circe/4.0.3?target=_2.13
For Circe, indlude the dependence:
"org.gnieh" %% f"diffson-circe" % "4.0.3"

JsonProtocol NoClassDefFoundError

I try to save a DataSet into an index ElasticSearch everyday (schedule with Oozie) but I have sometimes this error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.util.JsonProtocol so the job failed immediately. I don't know why this error appears.
Code :
private def readSource1()(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
val sourceName = "dictionary.source1"
val plantsPath: String = config.getString("sources." + sourceName + ".path")
spark.read
.option("delimiter", ";")
.option("header", "true")
.csv(plantsPath)
.select('id as "sourceId", 'assembly_site_id)
}
private def readSource2()(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
val source2: SourceIO = SourceManager(config)("source2")
(startDate, endDate) match {
case (Some(sd), Some(ed)) ⇒ source2.loadDf()
.where('assemblyEndDate.between(Date.valueOf(sd), Date.valueOf(ed)) ||
'tctDate.between(Date.valueOf(sd), Date.valueOf(ed)))
case _ ⇒ source2.loadDf()
}
}
def saveSourceToEs(implicit sparkSession: SparkSession): Unit = {
val source1: DataFrame = readSource1()
val source2: DataFrame = readSource2()
val source: Dataset[Source] = buildSource(this.getSource(source1, source2))
source.saveToEs(s"source_${createDateString()}/_doc")
}
object SourceIndexer extends SparkApp with Configurable with Logging {
val config: Config = ConfigFactory.load()
def apply(
sourceID: Option[String] = None,
startDate: Option[LocalDate] = None,
endDate: Option[LocalDate] = None
): SourceIndexer = {
new SourceIndexer(config, sourceID, startDate, endDate)
}
def main(args: Array[String]): Unit = {
try {
val bootConfig = BootConfig.parseSourceIndexer(args)
this.apply(bootConfig.sourceID, bootConfig.startDate, bootConfig.endDate)
.saveSourceToEs(spark)
} finally {
spark.sparkContext.stop()
}
}
}
Thanks for your help.

Parse JSON for Spark Structured Streaming

I am implemented Spark Structured Streaming, and for my use-case I have to specify the starting offsets.
And, I have the offset values in form of an Array[String]:
{"topic":"test","partition":0,"starting_offset":123}
{"topic":"test","partition":1,"starting_offset":456}
I would like to convert it to the below programmatically, so that I can pass it to Spark.
{"test":{"0":123,"1":456}}
Note: This is just a sample, I keep getting different offset ranges so I cannot hardcode it.

If array is the variable contaning the list you describe then:
>>> [{d['topic']: [d['partition'], d['starting_offset']]} for d in array]
[{'test': [0, 123]}, {'test': [1, 456]}]

scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(rawText => {
val json = parse(rawText)
val topicName = json \ "topic" // Extract topic value
val offsetForTopic = json \ "starting_offset" // Extract starting_offset
topicName -> offsetForTopic
})
scala> // Aggregate offsets for each topic
You can also use spark.sparkContext.parallelize API.
scala> case class KafkaTopic(topicName: String, partitionId: Int, starting_offset: Int)
scala> val spark: SparkSession = ???
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(line => json.parse(line).extract[KafkaTopic])
scala> val kafkaTopicDS = spark.sparkContext.parallelize(topicAsJSONs)
scala> val aggregatedOffsetsByTopic = kafkaTopicDS
.groupByKey("topic")
.mapGroups {
case (topicName, kafkaTopics) =>
val offsets = kafkaTopics.flatMap(kT => kT.starting_offset)
(topicName -> offsets.toSet)
}

Generic Play json Formatter

I have a scala application and have a case class like -
case class SR(
systemId: Option[String] = None,
x: Map[Timestamp, CaseClass1] = Map.empty,
y: Map[Timestamp, CaseClass2] = Map.empty,
y: Map[Timestamp, CaseClass3] = Map.empty
)
Now I have to provide an implicit read and write JSON format for properties x,y,z for SR case class like -
implicit val mapCMPFormat = new Format[Map[Timestamp, CaseClass1]] {
def writes(obj: Map[Timestamp, CaseClass1]): JsValue =
JsArray(obj.values.toSeq.map(Json.toJson(_)))
def reads(jv: JsValue): JsResult[Map[Timestamp, CaseClass1]] = jv.validate[scala.collection.Seq[CaseClass1]] match {
case JsSuccess(objs, path) => JsSuccess(objs.map(obj => obj.dataDate.get -> obj).toMap, path)
case err: JsError => err
}
}
And so on similarly for Y and Z, and in future, I will be adding many more properties like x,y,z in SR case class and then need to provide the formators.
So Can I get some Generic Formater that will take care for all types?

To my knowledge, a simple way to achieve this does not exists, however, to create a "default" reader for each object should not be hard to do, something like:
case class VehicleColorForAdd(
name: String,
rgb: String
)
object VehicleColorForAdd {
implicit val jsonFormat: Format[VehicleColorForAdd] = Json.formats[VehicleColorForAdd]
}
This way you have access to the implicit by simply using the object, so you could have other objects that contains this object with no problem:
case class BiggerModel(
vehicleColorForAdd: VehicleColorForAdd
)
object BiggerModel{
implicit val jsonFormat: Format[BiggerModel] = Json.format[BiggerModel]
}
Sadly, you need to do this for each class type, but you can "extend" play converters with your own, for example, this are some of my default readers:
package common.json
import core.order.Order
import org.joda.time.{ DateTime, LocalDateTime }
import org.joda.time.format.DateTimeFormat
import core.promotion.{ DailySchedule, Period }
import play.api.libs.functional.syntax._
import play.api.libs.json.Reads._
import play.api.libs.json._
import play.api.libs.json.{ JsError, JsPath, JsSuccess, Reads }
import scala.language.implicitConversions
/**
* General JSon readers and transformations.
*/
object JsonReaders {
val dateTimeFormat = "yyyy-MM-dd HH:mm:ss"
class JsPathHelper(val path: JsPath) {
def readTrimmedString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.trim)
def readUpperString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.toUpperCase)
def readNullableTrimmedString(implicit r: Reads[String]): Reads[Option[String]] = Reads.nullable[String](path)(r).map(_.map(_.trim))
}
implicit val localDateTimeReader: Reads[LocalDateTime] = Reads[LocalDateTime]((js: JsValue) =>
js.validate[String].map[LocalDateTime](dtString =>
LocalDateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
val localDateTimeWriter: Writes[LocalDateTime] = new Writes[LocalDateTime] {
def writes(d: LocalDateTime): JsValue = JsString(d.toString(dateTimeFormat))
}
implicit val localDateTimeFormat: Format[LocalDateTime] = Format(localDateTimeReader, localDateTimeWriter)
implicit val dateTimeReader: Reads[DateTime] = Reads[DateTime]((js: JsValue) =>
js.validate[String].map[DateTime](dtString =>
DateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
implicit def toJsPathHelper(path: JsPath): JsPathHelper = new JsPathHelper(path)
val defaultStringMax: Reads[String] = maxLength[String](255)
val defaultStringMinMax: Reads[String] = minLength[String](1) andKeep defaultStringMax
val rgbRegex: Reads[String] = pattern("""^#([\da-fA-F]{2})([\da-fA-F]{2})([\da-fA-F]{2})$""".r, "error.invalidRGBPattern")
val plateRegex: Reads[String] = pattern("""^[\d\a-zA-Z]*$""".r, "error.invalidPlatePattern")
val minOnlyWordsRegex: Reads[String] = minLength[String](2) keepAnd onlyWordsRegex
val positiveInt: Reads[Int] = min[Int](1)
val zeroPositiveInt: Reads[Int] = min[Int](0)
val zeroPositiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](0)
val positiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](1)
def validLocalDatePeriod()(implicit reads: Reads[Period]) =
Reads[Period](js => reads.reads(js).flatMap { o =>
if (o.startPeriod isAfter o.endPeriod)
JsError("error.startPeriodAfterEndPeriod")
else
JsSuccess(o)
})
def validLocalTimePeriod()(implicit reads: Reads[DailySchedule]) =
Reads[DailySchedule](js => reads.reads(js).flatMap { o =>
if (o.dailyStart isAfter o.dailyEnd)
JsError("error.dailyStartAfterDailyEnd")
else
JsSuccess(o)
})
}
Then, you only need to import this object to have access to all this implicit converters:
package common.forms
import common.json.JsonReaders._
import play.api.libs.json._
/**
* Form to add a model with only one string field.
*/
object SimpleCatalogAdd {
case class Data(
name: String
)
implicit val dataReads: Reads[Data] = (__ \ "name").readTrimmedString(defaultStringMinMax).map(Data.apply)
}

How to parse json with arbitrary schema, update/create one field and write it back as json(Scala)

how do one update/create field in JSON object with arbitrary schema and write it back as JSON in Scala?
I tried with spray-json with something like that:
import spray.json._
import DefaultJsonProtocol._
val jsonAst = """{"anyfield":"1234", "sought_optional_field":5.0}""".parse
val newValue = jsonAst.asJsObject.fields.getOrElse("sought_optional_field", 1)
val newMap = jsonAst.asJsObject.fields + ("sought_optional_field" -> newValue)
JSONObject(newMap).toJson
but it gives weird result: "{"anyfield"[ : "1234", "sought_optional_field" : ]1}

You were almost there :
import spray.json._
import DefaultJsonProtocol._
def changeField(json: String) = {
val jsonAst = JsonParser(json)
val map = jsonAst.asJsObject.fields
val sought = map.getOrElse("sought_optional_field", 1.toJson)
map.updated("sought_optional_field", sought).toJson
}
val jsonA = """{"anyfield":"1234", "sought_optional_field":5.0}"""
val jsonB = """{"anyfield":"1234"}"""
changeField(jsonA)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":5.0}
changeField(jsonB)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":1}
Using Argonaut:
import argonaut._, Argonaut._
def changeField2(json: String) =
json.parseOption.map( parsed =>
parsed.withObject(o =>
o + ("sought_optional_field", o("sought_optional_field").getOrElse(jNumber(1)))
)
)
changeField2(jsonA).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":5})
changeField2(jsonB).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":1})

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

toDF() not getting supported - json

Related

Compare key and values in json string using scala

JsonProtocol NoClassDefFoundError

Parse JSON for Spark Structured Streaming

Generic Play json Formatter

How to parse json with arbitrary schema, update/create one field and write it back as json(Scala)

Categories

Resources