Remove a key, value from a json object from scala - json

import scala.util.parsing.json._
val jsonObj = JSON.parseFull("{\"type\":\"record\",\"name\":\"ProductWithLatestPrice\",\"namespace\":\"models\",\"fields\":[{\"name\":\"isbn\",\"type\":[\"null\",{\"type\":\"string\",\"avro.java.string\":\"String\"}],\"default\":null},{\"name\":\"ku\",\"type\":[\"null\",{\"type\":\"string\",\"avro.java.string\":\"String\"}],\"default\":null},{\"name\":\"pc\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},\"default\":[]},{\"name\":\"mpn\",\"type\":{\"type\":\"array\",\"items\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},\"default\":[]},{\"name\":\"smallDescription\",\"type\":[\"null\",{\"type\":\"string\",\"avro.java.string\":\"String\"}],\"default\":null},{\"name\":\"longDescription\",\"type\":[\"null\",{\"type\":\"string\",\"avro.java.string\":\"String\"}],\"default\":null},{\"name\":\"specificationText\",\"type\":[\"null\",{\"type\":\"string\",\"avro.java.string\":\"String\"}],\"default\":null}]}")
I want to remove the key "smallDescription" and its values from this json without using regex. Any help on this?

This should work (updated to match nested array/object structures):
def remove(key: String)(x: Any): Any =
x match {
case m: Map[String, _] => m.mapValues(remove(key)) - key
case l: List[_] => l.map(remove(key))
case v => v
}
val jsonObj = JSON.parseFull("…").map(remove("smallDescription"))
I would recommend to use a JSON library though, like http://json4s.org/ or http://argonaut.io/.

Related

Get value by key from json array

I have a few json arrays
[{"key":"country","value":"aaa"},{"key":"region","value":"a"},{"key":"city","value":"a1"}]
[{"key":"city","value":"b"},{"key":"street","value":"1"}]
I need to extract city and street value into different columns.
Using get_json_object($"address", "$[2].value").as("city") to get element by it's number doesn't work because arrays can miss some fields.
Instead I decided to turn this array into a map of key -> value pairs, but have trouble doing it. So far I only managed to get an array of arrays.
val schema = ArrayType(StructType(Array(
StructField("key", StringType),
StructField("value", StringType)
)))
from_json($"address", schema)
Returns
[[country, aaa],[region, a],[city, a1]]
[[city, b],[street, 1]]
I'm not sure where to go from here.
val schema = ArrayType(MapType(StringType, StringType))
Fails with
cannot resolve 'jsontostructs(`address`)' due to data type mismatch: Input schema array<map<string,string>> must be a struct or an array of structs.;;
I'm using spark 2.2
Using UDF we can handle this easily. In the below code I have created a map using a UDF. I hope this will suffice the need
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df1 = spark.read.format("text").load("file path")
val schema = ArrayType(StructType(Array(
StructField("key", StringType),
StructField("value", StringType)
)))
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map { case Row(key: String, value: String) => (key, value) }.toMap
}
val dfJSON = df1.withColumn("jsonData",from_json(col("value"),schema))
.select("jsonData").withColumn("address", arrayToMap(col("jsonData")))
.withColumn("city", when(col("address.city").isNotNull, col("address.city")).otherwise(lit(""))).withColumn("street", when(col("address.street").isNotNull, col("address.street")).otherwise(lit("")))
dfJSON.printSchema()
dfJSON.show(false)

Using Scala to represent two JSON fields of which only one can be null

Let's say my API returns a JSON that looks like this:
{
"field1": "hey",
"field2": null,
}
I have this rule that only one of these fields can be null at the same time. In this example, only field2 is null so we're ok.
I can represent this in Scala with the following case class:
case class MyFields(
field1: Option[String],
field2: Option[String]
)
And by implementing some implicits and let circe do it's magic of converting the objects to JSON.
object MyFields {
implicit lazy val encoder: Encoder[MyFields] = deriveEncoder[MyFields]
implicit lazy val decoder: Decoder[MyFields] = deriveDecoder[MyFields]
Now, this strategy works. Kinda.
MyFields(Some("hey"), None)
MyFields(None, Some("hey"))
MyFields(Some("hey"), Some("hey"))
These all lead to JSONs that follow the rule. But it's also possible to do:
MyFields(None, None)
Which will lead to a JSON that breaks the rule.
So this strategy doesn't express the rule adequately. What's a better way to do it?
represent your data as having a member val field1or2 = Either[String, String]. If the circe built-in Codec.codecForEither doesn't meet your exact requirements, you could write your codec manually. From what you describe (field1 and fiel2 both must be present in json, one as string, one as null), something like
import io.circe.{Decoder, Encoder, HCursor, Json, DecodingFailure}
case class Fields(fields: Either[String, String])
implicit val encodeFields: Encoder[Fields] = new Encoder[Fields] {
final def apply(a: Fields): Json = Json.obj(a.fields match {
case Left(str) => {
"field1" -> Json.fromString(str)
"field2" -> Json.Null
}
case Right(str) => {
"field1" -> Json.Null
"field2" -> Json.fromString(str)
}
})
}
implicit val decodeFields: Decoder[Fields] = new Decoder[Fields] {
final def apply(c: HCursor): Decoder.Result[Fields] ={
val f1 = c.downField("field1").as[Option[String]]
val f2 = c.downField("field1").as[Option[String]]
(f1, f2) match {
case (Right(None), Right(Some(v2))) => Right(Fields(Right(v2)))
case (Right(Some(v1)), Right(None)) => Right(Fields(Left(v1)))
case (Left(failure), _) => Left(failure)
case (_, Left(failure)) => Left(failure)
case (Right(None), Right(None)) => Left(DecodingFailure("Either field1 or field2 must be non-null", Nil))
case (Right(Some(_)), Right(Some(_))) => Left(DecodingFailure("field1 and field2 may not both be non-null", Nil))
}
}
}
This is an example of application-level verification of data, along with range-checking values and other consistency checks. As such, it doesn't really belong with the raw JSON parsing, but needs to be done in a separate validation step.
So I recommend having a set of classes that directly represent the JSON data, and then a separate set of application classes that use application data types rather than JSON data types.
When reading data, the JSON is read into the data classes to check that the underlying JSON is valid. Then those data classes are converted to application classes with appropriate validation and conversion (checking value ranges, changing strings to enumerations etc.)
This allows the data format to be changed (e.g. to XML or database) without affecting the application logic.
(This is based on Martijn's answer and comment.)
Cats Ior datatype could be used, as following:
import cats.data.Ior
import io.circe.parser._
import io.circe.syntax._
import io.circe._
case class Fields(fields: Ior[String, String])
implicit val encodeFields: Encoder[Fields] = (a: Fields) =>
a.fields match {
case Ior.Both(v1, v2) => Json.obj(
("field1", Json.fromString(v1)),
("field2", Json.fromString(v2))
)
case Ior.Left(v) => Json.obj(
("field1", Json.fromString(v)),
("field2", Json.Null)
)
case Ior.Right(v) => Json.obj(
("field1", Json.Null),
("field2", Json.fromString(v))
)
}
implicit val decodeFields: Decoder[Fields] = (c: HCursor) => {
val f1 = c.downField("field1").as[Option[String]]
val f2 = c.downField("field2").as[Option[String]]
(f1, f2) match {
case (Right(Some(v1)), Right(Some(v2))) => Right(Fields(Ior.Both(v1, v2)))
case (Right(Some(v1)), Right(None)) => Right(Fields(Ior.Left(v1)))
case (Right(None), Right(Some(v2))) => Right(Fields(Ior.Right(v2)))
case (Left(failure), _) => Left(failure)
case (_, Left(failure)) => Left(failure)
case (Right(None), Right(None)) => Left(DecodingFailure("At least one of field1 or field2 must be non-null", Nil))
}
}
println(Fields(Ior.Right("right")).asJson)
println(Fields(Ior.Left("left")).asJson)
println(Fields(Ior.both("right", "left")).asJson)
println(parse("""{"field1": null, "field2": "right"}""").flatMap(_.as[Fields]))

How to convert Row to json in Spark 2 Scala

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Scala argonaut encode a jEmtpyObject to 'false' rather than 'null'

Im dealing with some code outside of my immediate control, where I need to encode an option[Thing] where the case is as per normal if Thing exists, however the None case must return 'false' rather than null. Can this be accomplished easily? I'm looking at the docs but not having much success.
My code looks like this:
case class Thing(name: String)
case class BiggerThing(stuff: String, thing: Option[Thing])
implict val ThingEncodeJson: EncodeJson[Thing] =
EncodeJson(t => ("name" := t.name ) ->: jEmptyObject)
and the equivalent for BiggerThing, and the json needs to look like:
For a Some:
"thing":{"name": "bob"}
For a None:
"thing": false
but at present the None case gives:
"thing":null
How do I get it to return false? Could someone point me in the right direction please?
Cheers
You just need a custom CodecJson instance for Option[Thing]:
object Example {
import argonaut._, Argonaut._
case class Thing(name: String)
case class BiggerThing(stuff: String, thing: Option[Thing])
implicit val encodeThingOption: CodecJson[Option[Thing]] =
CodecJson(
(thing: Option[Thing]) => thing.map(_.asJson).getOrElse(jFalse),
json =>
// Adopt the easy approach when parsing, that is, if there's no
// `name` property, assume it was `false` and map it to a `None`.
json.get[Thing]("name").map(Some(_)) ||| DecodeResult.ok(None)
)
implicit val encodeThing: CodecJson[Thing] =
casecodec1(Thing.apply, Thing.unapply)("name")
implicit val encodeBiggerThing: CodecJson[BiggerThing] =
casecodec2(BiggerThing.apply, BiggerThing.unapply)("stuff", "thing")
def main(args: Array[String]): Unit = {
val a = BiggerThing("stuff", Some(Thing("name")))
println(a.asJson.nospaces) // {"stuff":"stuff","thing":{"name":"name"}}
val b = BiggerThing("stuff", None)
println(b.asJson.nospaces) // {"stuff":"stuff","thing":false}
}
}
How to encode a BiggerThing without a thing property when thing is None. You need a custom EncodeJson[BiggerThing] instance then:
implicit val decodeBiggerThing: DecodeJson[BiggerThing] =
jdecode2L(BiggerThing.apply)("stuff", "thing")
implicit val encodeBiggerThing: EncodeJson[BiggerThing] =
EncodeJson { biggerThing =>
val thing = biggerThing.thing.map(t => Json("thing" := t))
("stuff" := biggerThing.stuff) ->: thing.getOrElse(jEmptyObject)
}

How to transform any JSON string to Map[Symbol, Any] using import play.api.libs.json?

I cannot figure out if there is a way to transform a JSON fragment (as a String) in a Map[Symbol,Any] using play.api.libs.json library, where Any could be a Int, a Double, a String or a nested Map[Symbol,Any].
Can anybody give me a hint to get this?
JsObject.fieldSet will give you a Set[(String, JsValue)] that you can transform into a Map[Symbol, Any]. You will have to pattern match on all possible subclasses of JsValue and transform each to the type you want.
For example, something like this:
Json.parse(text) match {
case js: JsObject =>
js.fieldSet.map {
case (key, value) => Symbol(key) -> transform(value)
}.toMap
case x => throw new RuntimeException(s"Expected object json but got $text")
}
def transform(jsValue): Any = jsValue match {
case JsNumber(value) => value.toDouble
...ect...
}