I am very new to Scala. I am trying to convert an Iterable[dataSet[Row]] to a dataframe. Its not working for me. Here is the code
def execute(spark: SparkSession,
input: Iterable[Dataset[Row]],
execParams: Map[String, String]): Dataset[Row] = {
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val sparkSession: SparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val jsonSeq = Seq(input)
val jsonRDD = sparkSession.sparkContext.parallelize(jsonSeq)
val jsonDF = jsonRDD.toDF()
}
You can first convert to a Dataset and then convert to a Dataframe
def execute(spark: SparkSession,
input: Iterable[Dataset[Row]],
execParams: Map[String, String]): Dataset[Row] = {
import spark.implicits._
val jsonSeq = Seq(input)
val jsonRDD = spark.sparkContext.parallelize(jsonSeq)
val jsonDF = spark.createDataset(jsonRDD).toDF()
}
If you don't like converting to a Dataset, you can specify the type of RDD:
val jsonRDD: RDD[Iterable[Dataset[Row]]] = spark.sparkContext.parallelize(jsonSeq)
val jsonDF = spark.createDataFrame[Iterable[Dataset[Row]]](jsonRDD).toDF()
Related
Want to compare the first json string with the other 2 json string.
First the keys should match . If they match , then compare the nested key and values.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
Should throw Error for key mismatch.
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of2 = "{\"keyA\":{\"1\":11,\"0\":200}}"
This should return true, as both keys match and also sub keys 1 and 0 have more values than sub key of json 2 and json 3.The numbers are Long values.
Please help.
Below is my try.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
val of1 = "{\"keyA\":{\"1\":13,\"0\":202}}"
val of2 = "{\"keyA\":{\"1\":12,\"0\":201}}"
val of3 = "{\"keyB\":{\"1\":12}}"
def OffsetComparator(json1: String, json2: String, json3:String): Boolean = {
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val jsonObj1 = mapper.readValue(json1, classOf[Map[String, Map[String, Long]]])
val jsonObj2 = mapper.readValue(json2, classOf[Map[String, Map[String, Long]]])
val jsonObj3 = mapper.readValue(json3, classOf[Map[String, Map[String, Long]]])
//Trying to get the key and compare first
val mapA = jsonObj1.keySet.foreach(i=>jsonObj1.keySet(i).toString)
val mapB = jsonObj2.keySet
val mapC = jsonObj3.keySet
println( (jsonObj1.keySet == jsonObj3.keySet) )
if (mapA.keySet != mapB.keySet || mapA.keySet != mapC.keySet) throw new Exception("partitions mismatch")
mapA.keys.forall(k => (mapA(k).asInstanceOf[Long] > mapB(k).asInstanceOf[Long] && mapA(k).asInstanceOf[Long] > mapC(k).asInstanceOf[Long]))
// getting error :java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long when i am casting as Long.Not su
}
println(OffsetComparator(of1, of2,of3))
}
You can try with https://github.com/gnieh/diffson. ITs available for circe, spray-json and play-json.
Your example with Circe:
import diffson._
import diffson.lcs._
import diffson.jsonpatch.lcsdiff._
import io.circe._
import diffson.circe._
import diffson.jsonpatch._
import io.circe.parser._
val decoder = Decoder[JsonPatch[Json]]
val encoder = Encoder[JsonPatch[Json]]
implicit val lcs = new Patience[Json]
val json1 = parse(of1)
val json2 = parse(of2)
val patch =
for {
json1 <- json1
json2 <- json2
} yield diff(json1, json2)
print(patch)
That gives:
Right(JsonPatch(List(Replace(Chain(Left(keyA), Left(0)),201,None), Replace(Chain(Left(keyA), Left(1)),12,None))))
take a look to see how it works https://index.scala-lang.org/gnieh/diffson/diffson-circe/4.0.3?target=_2.13
For Circe, indlude the dependence:
"org.gnieh" %% f"diffson-circe" % "4.0.3"
I try to save a DataSet into an index ElasticSearch everyday (schedule with Oozie) but I have sometimes this error java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.util.JsonProtocol so the job failed immediately. I don't know why this error appears.
Code :
private def readSource1()(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
val sourceName = "dictionary.source1"
val plantsPath: String = config.getString("sources." + sourceName + ".path")
spark.read
.option("delimiter", ";")
.option("header", "true")
.csv(plantsPath)
.select('id as "sourceId", 'assembly_site_id)
}
private def readSource2()(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
val source2: SourceIO = SourceManager(config)("source2")
(startDate, endDate) match {
case (Some(sd), Some(ed)) ⇒ source2.loadDf()
.where('assemblyEndDate.between(Date.valueOf(sd), Date.valueOf(ed)) ||
'tctDate.between(Date.valueOf(sd), Date.valueOf(ed)))
case _ ⇒ source2.loadDf()
}
}
def saveSourceToEs(implicit sparkSession: SparkSession): Unit = {
val source1: DataFrame = readSource1()
val source2: DataFrame = readSource2()
val source: Dataset[Source] = buildSource(this.getSource(source1, source2))
source.saveToEs(s"source_${createDateString()}/_doc")
}
object SourceIndexer extends SparkApp with Configurable with Logging {
val config: Config = ConfigFactory.load()
def apply(
sourceID: Option[String] = None,
startDate: Option[LocalDate] = None,
endDate: Option[LocalDate] = None
): SourceIndexer = {
new SourceIndexer(config, sourceID, startDate, endDate)
}
def main(args: Array[String]): Unit = {
try {
val bootConfig = BootConfig.parseSourceIndexer(args)
this.apply(bootConfig.sourceID, bootConfig.startDate, bootConfig.endDate)
.saveSourceToEs(spark)
} finally {
spark.sparkContext.stop()
}
}
}
Thanks for your help.
I am implemented Spark Structured Streaming, and for my use-case I have to specify the starting offsets.
And, I have the offset values in form of an Array[String]:
{"topic":"test","partition":0,"starting_offset":123}
{"topic":"test","partition":1,"starting_offset":456}
I would like to convert it to the below programmatically, so that I can pass it to Spark.
{"test":{"0":123,"1":456}}
Note: This is just a sample, I keep getting different offset ranges so I cannot hardcode it.
If array is the variable contaning the list you describe then:
>>> [{d['topic']: [d['partition'], d['starting_offset']]} for d in array]
[{'test': [0, 123]}, {'test': [1, 456]}]
scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(rawText => {
val json = parse(rawText)
val topicName = json \ "topic" // Extract topic value
val offsetForTopic = json \ "starting_offset" // Extract starting_offset
topicName -> offsetForTopic
})
scala> // Aggregate offsets for each topic
You can also use spark.sparkContext.parallelize API.
scala> case class KafkaTopic(topicName: String, partitionId: Int, starting_offset: Int)
scala> val spark: SparkSession = ???
scala> val topicAsRawStr: Array[String] = Array(
"""{"topic":"test","partition":0,"starting_offset":123}""",
"""{"topic":"test","partition":1,"starting_offset":456}""")
scala> val topicAsJSONs = topicAsRawStr.map(line => json.parse(line).extract[KafkaTopic])
scala> val kafkaTopicDS = spark.sparkContext.parallelize(topicAsJSONs)
scala> val aggregatedOffsetsByTopic = kafkaTopicDS
.groupByKey("topic")
.mapGroups {
case (topicName, kafkaTopics) =>
val offsets = kafkaTopics.flatMap(kT => kT.starting_offset)
(topicName -> offsets.toSet)
}
I have a scala application and have a case class like -
case class SR(
systemId: Option[String] = None,
x: Map[Timestamp, CaseClass1] = Map.empty,
y: Map[Timestamp, CaseClass2] = Map.empty,
y: Map[Timestamp, CaseClass3] = Map.empty
)
Now I have to provide an implicit read and write JSON format for properties x,y,z for SR case class like -
implicit val mapCMPFormat = new Format[Map[Timestamp, CaseClass1]] {
def writes(obj: Map[Timestamp, CaseClass1]): JsValue =
JsArray(obj.values.toSeq.map(Json.toJson(_)))
def reads(jv: JsValue): JsResult[Map[Timestamp, CaseClass1]] = jv.validate[scala.collection.Seq[CaseClass1]] match {
case JsSuccess(objs, path) => JsSuccess(objs.map(obj => obj.dataDate.get -> obj).toMap, path)
case err: JsError => err
}
}
And so on similarly for Y and Z, and in future, I will be adding many more properties like x,y,z in SR case class and then need to provide the formators.
So Can I get some Generic Formater that will take care for all types?
To my knowledge, a simple way to achieve this does not exists, however, to create a "default" reader for each object should not be hard to do, something like:
case class VehicleColorForAdd(
name: String,
rgb: String
)
object VehicleColorForAdd {
implicit val jsonFormat: Format[VehicleColorForAdd] = Json.formats[VehicleColorForAdd]
}
This way you have access to the implicit by simply using the object, so you could have other objects that contains this object with no problem:
case class BiggerModel(
vehicleColorForAdd: VehicleColorForAdd
)
object BiggerModel{
implicit val jsonFormat: Format[BiggerModel] = Json.format[BiggerModel]
}
Sadly, you need to do this for each class type, but you can "extend" play converters with your own, for example, this are some of my default readers:
package common.json
import core.order.Order
import org.joda.time.{ DateTime, LocalDateTime }
import org.joda.time.format.DateTimeFormat
import core.promotion.{ DailySchedule, Period }
import play.api.libs.functional.syntax._
import play.api.libs.json.Reads._
import play.api.libs.json._
import play.api.libs.json.{ JsError, JsPath, JsSuccess, Reads }
import scala.language.implicitConversions
/**
* General JSon readers and transformations.
*/
object JsonReaders {
val dateTimeFormat = "yyyy-MM-dd HH:mm:ss"
class JsPathHelper(val path: JsPath) {
def readTrimmedString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.trim)
def readUpperString(implicit r: Reads[String]): Reads[String] = Reads.at[String](path)(r).map(_.toUpperCase)
def readNullableTrimmedString(implicit r: Reads[String]): Reads[Option[String]] = Reads.nullable[String](path)(r).map(_.map(_.trim))
}
implicit val localDateTimeReader: Reads[LocalDateTime] = Reads[LocalDateTime]((js: JsValue) =>
js.validate[String].map[LocalDateTime](dtString =>
LocalDateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
val localDateTimeWriter: Writes[LocalDateTime] = new Writes[LocalDateTime] {
def writes(d: LocalDateTime): JsValue = JsString(d.toString(dateTimeFormat))
}
implicit val localDateTimeFormat: Format[LocalDateTime] = Format(localDateTimeReader, localDateTimeWriter)
implicit val dateTimeReader: Reads[DateTime] = Reads[DateTime]((js: JsValue) =>
js.validate[String].map[DateTime](dtString =>
DateTime.parse(dtString, DateTimeFormat.forPattern(dateTimeFormat))))
implicit def toJsPathHelper(path: JsPath): JsPathHelper = new JsPathHelper(path)
val defaultStringMax: Reads[String] = maxLength[String](255)
val defaultStringMinMax: Reads[String] = minLength[String](1) andKeep defaultStringMax
val rgbRegex: Reads[String] = pattern("""^#([\da-fA-F]{2})([\da-fA-F]{2})([\da-fA-F]{2})$""".r, "error.invalidRGBPattern")
val plateRegex: Reads[String] = pattern("""^[\d\a-zA-Z]*$""".r, "error.invalidPlatePattern")
val minOnlyWordsRegex: Reads[String] = minLength[String](2) keepAnd onlyWordsRegex
val positiveInt: Reads[Int] = min[Int](1)
val zeroPositiveInt: Reads[Int] = min[Int](0)
val zeroPositiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](0)
val positiveBigDecimal: Reads[BigDecimal] = min[BigDecimal](1)
def validLocalDatePeriod()(implicit reads: Reads[Period]) =
Reads[Period](js => reads.reads(js).flatMap { o =>
if (o.startPeriod isAfter o.endPeriod)
JsError("error.startPeriodAfterEndPeriod")
else
JsSuccess(o)
})
def validLocalTimePeriod()(implicit reads: Reads[DailySchedule]) =
Reads[DailySchedule](js => reads.reads(js).flatMap { o =>
if (o.dailyStart isAfter o.dailyEnd)
JsError("error.dailyStartAfterDailyEnd")
else
JsSuccess(o)
})
}
Then, you only need to import this object to have access to all this implicit converters:
package common.forms
import common.json.JsonReaders._
import play.api.libs.json._
/**
* Form to add a model with only one string field.
*/
object SimpleCatalogAdd {
case class Data(
name: String
)
implicit val dataReads: Reads[Data] = (__ \ "name").readTrimmedString(defaultStringMinMax).map(Data.apply)
}
how do one update/create field in JSON object with arbitrary schema and write it back as JSON in Scala?
I tried with spray-json with something like that:
import spray.json._
import DefaultJsonProtocol._
val jsonAst = """{"anyfield":"1234", "sought_optional_field":5.0}""".parse
val newValue = jsonAst.asJsObject.fields.getOrElse("sought_optional_field", 1)
val newMap = jsonAst.asJsObject.fields + ("sought_optional_field" -> newValue)
JSONObject(newMap).toJson
but it gives weird result: "{"anyfield"[ : "1234", "sought_optional_field" : ]1}
You were almost there :
import spray.json._
import DefaultJsonProtocol._
def changeField(json: String) = {
val jsonAst = JsonParser(json)
val map = jsonAst.asJsObject.fields
val sought = map.getOrElse("sought_optional_field", 1.toJson)
map.updated("sought_optional_field", sought).toJson
}
val jsonA = """{"anyfield":"1234", "sought_optional_field":5.0}"""
val jsonB = """{"anyfield":"1234"}"""
changeField(jsonA)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":5.0}
changeField(jsonB)
// spray.json.JsValue = {"anyfield":"1234","sought_optional_field":1}
Using Argonaut:
import argonaut._, Argonaut._
def changeField2(json: String) =
json.parseOption.map( parsed =>
parsed.withObject(o =>
o + ("sought_optional_field", o("sought_optional_field").getOrElse(jNumber(1)))
)
)
changeField2(jsonA).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":5})
changeField2(jsonB).map(_.nospaces)
// Option[String] = Some({"anyfield":"1234","sought_optional_field":1})