Suppose I have the following structure I want to serialize in Json:
case class A(name:String)
case class B(age:Int)
case class C(id:String, a:A,b:B)
I'm using lift-json "write(...)" , but I want to flatten the structure so instead of:
{ id:xx , a:{ name:"xxxx" }, b:{ age:xxxx } }
I want to get:
{ id:xx , name:"xxxx" , age:xxxx }
Use transform method on JValue:
import net.liftweb.json._
import net.liftweb.json.JsonAST._
implicit val formats = net.liftweb.json.DefaultFormats
val c1 = C("c1", A("some-name"), B(42))
val c1flat = Extraction decompose c1 transform { case JField(x, JObject(List(jf))) if x == "a" || x == "b" => jf }
val c1str = Printer pretty (JsonAST render c1flat)
Result:
c1str: String =
{
"id":"c1",
"name":"some-name",
"age":42
}
If A and B have multiple fields you will want a slightly different approach:
import net.liftweb.json._
import net.liftweb.json.JsonAST._
import net.liftweb.json.JsonDSL._
implicit val formats = net.liftweb.json.DefaultFormats
implicit def cToJson(c: C): JValue = (("id" -> c.id):JValue) merge (Extraction decompose c.a) merge (Extraction decompose c.b)
val c1 = C("c1", A("a name", "a nick", "an alias"), B(11, 111, 1111))
Printer pretty (JsonAST render c1)
res0: String =
{
"id":"c1",
"name":"a name",
"nick":"a nick",
"alias":"an alias",
"age":11,
"weight":111,
"height":1111
}
You can declare a new object D with fields (id, name, age) and load the values you want in the constructor then serialize that class to json. There may be another way but this way will work.
Related
I have a nested json file that I am reading as Spark DataFrame and that I want to replace certain values in using an own transformation.
For now let's assume it looks as follows (which follows this)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json: String, schema: StructType = null): DataFrame = {
// SparkSessions are available with Spark 2.0+
val reader = spark.read
Option(schema).foreach(reader.schema)
reader.json(sc.parallelize(Array(json)))
}
val df = jsonToDataFrame("""
{
"A": {
"B": "b",
"C": "c",
"D": {"E": "e"
}
}
}
""")
display(df)
df.printSchema()
Suppose the following transformation (turn lower-case to upper-case) shall be applied for certain values in above Spark DataFrame
import org.apache.spark.sql.functions.udf
val upper: String => String = _.toUpperCase
val upperUDF = udf(upper)
While this doesn't work at all:
df.withColumn("A.B", upperUDF('A.B)).show()
the following works:
val df1 = df.select("A.B")
df1.withColumn("B", upperUDF('B)).show()
But in the end I want to stick to my nested structure and just replace certain values accordign to my transformation.
How can one achieve that? How can one preserve the schema wehen using withColumn?
Finally I have found this thread which gives the answer to my question. The trick is to dynamically preserve the schema while transforming the columns. Using the mutate() function defined therein, the following woks well for me:
val df2 = mutate(df, c => if (c.toString == "A.B") upperUDF(c) else c)
val df3 = mutate(df, c => if (c.toString == "A.D.E") upperUDF(c) else c)
display(df2)
df2.printSchema
display(df3)
df3.printSchema
How can I write circe decoder for class
case class KeyValueRow(count: Int, key: String)
where json contains field "count" (Int) and some extra string-field (name of this field may be various, like "url", "city", whatever)?
{"count":974989,"url":"http://google.com"}
{"count":1234,"city":"Rome"}
You can do what you need like this:
import io.circe.syntax._
import io.circe.parser._
import io.circe.generic.semiauto._
import io.circe.{ Decoder, Encoder, HCursor, Json, DecodingFailure}
object stuff{
case class KeyValueRow(count: Int, key: String)
implicit def jsonEncoder : Encoder[KeyValueRow] = deriveEncoder
implicit def jsonDecoder : Decoder[KeyValueRow] = Decoder.instance{ h =>
(for{
keys <- h.keys
key <- keys.dropWhile(_ == "count").headOption
} yield {
for{
count <- h.get[Int]("count")
keyValue <- h.get[String](key)
} yield KeyValueRow(count.toInt, keyValue)
}).getOrElse(Left(DecodingFailure("Not a valid KeyValueRow", List())))
}
}
import stuff._
val a = KeyValueRow(974989, "www.google.com")
println(a.asJson.spaces2)
val test1 = """{"count":974989,"url":"http://google.com"}"""
val test2 = """{"count":1234,"city":"Rome", "will be dropped": "who cares"}"""
val parsedTest1 = parse(test1).flatMap(_.as[KeyValueRow])
val parsedTest2 = parse(test2).flatMap(_.as[KeyValueRow])
println(parsedTest1)
println(parsedTest2)
println(parsedTest1.map(_.asJson.spaces2))
println(parsedTest2.map(_.asJson.spaces2))
Scalafiddle: link here
As I mentioned in the comment above, keep in mind that the that if you decode some json, and then re-encode it, the result will be different from the initial input. To fix that, you would need to keep track of the original name of the key field.
I would like to remove a top level field called "id" in a json structure, without removing all fields named "id", which happens when I run the following code:
scala> import org.json4s._
import org.json4s._
scala> import org.json4s.native.JsonMethods._
import org.json4s.native.JsonMethods._
scala> import org.json4s.JsonDSL._
import org.json4s.JsonDSL._
scala> val json = parse("""{ "id": "bep", "foo": { "id" : "bap" } }""")
json: org.json4s.JValue = JObject(List((id,JString(bep)), (foo,JObject(List((id,JString(bap)))))))
scala> json removeField {
| case ("id", v) => true
| case _ => false
| }
res0: org.json4s.JValue = JObject(List((foo,JObject(List()))))
Any idea how I can avoid removing the inner "id" field?
Edit: unfortunately I do not have the ability to list all the possible top level objects the json contains or can contain.
It seems like removeField applies to the entire JSON tree.
This works only for the top-level:
val updated = json match {
case JObject(l) => JObject(l.filter {
case (name, _) => name != "id"
})
}
Based on the answer here you could do something like this:
val transformedJson2 = json transform {
case JField("id", _) => JNothing
case JField("foo", fields) => fields
}
It is definitely not ideal because you would have to specify all the foo elements with a sub element id
To do this without the need to know the schema of other fields:
JObject(
json.asInstanceOf[JObject].obj.filterNot(_._1 == "id")
)
JObject().obj is the flat list of fields that compose the object.
This doesn't seem possible staying within the json4s DSL though.
Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.
I want to have different names of fields in my case classes and in my JSON, therefore I need a comfortable way of renaming in both, encoding and decoding.
Does someone have a good solution ?
You can use Custom key mappings via annotations. The most generic way is the JsonKey annotation from io.circe.generic.extras._. Example from the docs:
import io.circe.generic.extras._, io.circe.syntax._
implicit val config: Configuration = Configuration.default
#ConfiguredJsonCodec case class Bar(#JsonKey("my-int") i: Int, s: String)
Bar(13, "Qux").asJson
// res5: io.circe.Json = JObject(object[my-int -> 13,s -> "Qux"])
This requires the package circe-generic-extras.
Here's a code sample for Decoder (bit verbose since it won't remove the old field):
val pimpedDecoder = deriveDecoder[PimpClass].prepare {
_.withFocus {
_.mapObject { x =>
val value = x("old-field")
value.map(x.add("new-field", _)).getOrElse(x)
}
}
}
implicit val decodeFieldType: Decoder[FieldType] =
Decoder.forProduct5("nth", "isVLEncoded", "isSerialized", "isSigningField", "type")
(FieldType.apply)
This is a simple way if you have lots of different field names.
https://circe.github.io/circe/codecs/custom-codecs.html
You can use the mapJson function on Encoder to derive an encoder from the generic one and remap your field name.
And you can use the prepare function on Decoder to transform the JSON passed to a generic Decoder.
You could also write both from scratch, but it may be a ton of boilerplate, those solutions should both be a handful of lines max each.
The following function can be used to rename a circe's JSON field:
import io.circe._
object CirceUtil {
def renameField(json: Json, fieldToRename: String, newName: String): Json =
(for {
value <- json.hcursor.downField(fieldToRename).focus
newJson <- json.mapObject(_.add(newName, value)).hcursor.downField(fieldToRename).delete.top
} yield newJson).getOrElse(json)
}
You can use it in an Encoder like so:
implicit val circeEncoder: Encoder[YourCaseClass] = deriveEncoder[YourCaseClass].mapJson(
CirceUtil.renameField(_, "old_field_name", "new_field_name")
)
Extra
Unit tests
import io.circe.parser._
import org.specs2.mutable.Specification
class CirceUtilSpec extends Specification {
"CirceUtil" should {
"renameField" should {
"correctly rename field" in {
val json = parse("""{ "oldFieldName": 1 }""").toOption.get
val resultJson = CirceUtil.renameField(json, "oldFieldName", "newFieldName")
resultJson.hcursor.downField("oldFieldName").focus must beNone
resultJson.hcursor.downField("newFieldName").focus must beSome
}
"return unchanged json if field is not found" in {
val json = parse("""{ "oldFieldName": 1 }""").toOption.get
val resultJson = CirceUtil.renameField(json, "nonExistentField", "newFieldName")
resultJson must be equalTo json
}
}
}
}