Nested Json extract the value with unknown key in the middle - json

I have a Json column(colJson) in a dataframe like this
{
"a": "value1",
"b": "value1",
"c": true,
"details": {
"qgiejfkfk123": { //unknown value
"model1": {
"score": 0.531,
"version": "v1"
},
"model2": {
"score": 0.840,
"version": "v2"
},
"other_details": {
"decision": false,
"version": "v1"
}
}
}
}
Here 'qgiejfkfk123' is dynamic value and changes with each row. However I need to extract model1.score as well as model2.score.
I tried
sourceDf.withColumn("model1_score",get_json_object(col("colJson"), "$.details.*.model1.score").cast(DoubleType))
.withColumn("model2_score",get_json_object(col("colJson"), "$.details.*.model2.score").cast(DoubleType))
but did not work.

I managed to get your solution by using from_json, parsing the dynamic value as Map<String, Struct> and exploding the values from it:
val schema = "STRUCT<`details`: MAP<STRING, STRUCT<`model1`: STRUCT<`score`: DOUBLE, `version`: STRING>, `model2`: STRUCT<`score`: DOUBLE, `version`: STRING>, `other_details`: STRUCT<`decision`: BOOLEAN, `version`: STRING>>>>"
val fromJsonDf = sourceDf.withColumn("colJson", from_json(col("colJson"), lit(schema)))
val explodeDf = fromJsonDf.select($"*", explode(col("colJson.details")))
// +----------------------------------------------------------+------------+--------------------------------------+
// |colJson |key |value |
// +----------------------------------------------------------+------------+--------------------------------------+
// |{{qgiejfkfk123 -> {{0.531, v1}, {0.84, v2}, {false, v1}}}}|qgiejfkfk123|{{0.531, v1}, {0.84, v2}, {false, v1}}|
// +----------------------------------------------------------+------------+--------------------------------------+
val finalDf = explodeDf.select(col("value.model1.score").as("model1_score"), col("value.model2.score").as("model2_score"))
// +------------+------------+
// |model1_score|model2_score|
// +------------+------------+
// | 0.531| 0.84|
// +------------+------------+

Related

Kotlinx serialization parsing enum ignore unknown value

I've a json which looks like:
[
{
"object": [
{
"enumValue1": "value1",
"value2": 1
}
],
},
{
"object2": [
{
"enumValue1": "value2",
"value2": 2
}
],
},
{
"object3": [
{
"enumValue1": "value1",
"value2": 3
}
],
},
{
"object4": [
{
"enumValue1": "unknown",
"value2": 4
}
],
},
]
I want to parse this json with kotlinx serialization
I've created a class and an enum:
#Serializable
data class Class {
#SerialName("enumValue1")
val enumValue1: EnumValue
#SerialName("value2")
val value2: Int
}
#Serializable
enum class EnumValue {
#SerialName("value1") VALUE_1,
#SerialName("value2") VALUE_2
}
I expect the output of the parsing to be a list, with 3 objects in (object with value "unknown" not parsed)
How could I achieve it?
I have try:
ignoreUnknownKeys = true
coerceInputValues = true
But it doesn' work:
Field 'enumValue1' is required for type with serial name, but it was missing
Thanks for your help
You should declare enumValue1 as nullable:
val enumValue1: EnumValue?
That will make it optional.

Select a particular value from a JSON array

I have a JSON array containing the following details, I would like to extract the text Alignment value of Right and assign it to a val.
"data":[
{
"formatType": "text",
"value": "bgyufcie huis huids hufhsduhfsl hd"
},
{
"formatType": "text size",
"value": 12
},
{
"formatType": "text alignment",
"value" : "right"
}
]
Any thoughts?
Using the Gson library, you can map the json in a Java object.
So, you have to create a class like this:
public class MyObject{
private String formatType;
private String value;
//Constuctors, Getter and Setter...
//.....
//.....
}
After, using the method fromJson you can create an array of MyObject.
Gson gson = new Gson();
MyObject[] array = gson.fromJson(new FileReader("file.json"), MyObject[].class);
You can also use json4s library as shown next:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val json = """{
"data":[
{
"formatType": "text",
"value": "bgyufcie huis huids hufhsduhfsl hd"
},
{
"formatType": "text size",
"value": 12
},
{
"formatType": "text alignment",
"value" : "right"
}
]
}"""
val parsed = parse(json)
val value = (parsed \ "data" \\ classOf[JObject]).filter(m => m("formatType") == "text alignment")(0)("value")
// value: Any = right
The filter (parsed \ "data" \\ classOf[JObject]) extracts all the items into a List of Map i.e:
List(
Map(formatType -> text, value -> bgyufcie huis huids hufhsduhfsl hd),
Map(formatType -> text size, value -> 12), Map(formatType -> text alignment, value -> right)
).
From those we apply the filter filter(m => m("formatType") == "text alignment") to retrieve the record that we really need.
Use Dijon FTW!
Here is a test that demonstrates how easily the "right" value can be found in samples like yours:
import com.github.pathikrit.dijon._
val json = parse(
"""{
|"data":[
| {
| "formatType": "text",
| "value": "bgyufcie huis huids hufhsduhfsl hd"
| },
| {
| "formatType": "text size",
| "value": 12
| },
| {
| "formatType": "text alignment",
| "value" : "right"
| }
|]
|}""".stripMargin)
assert(json.data.toSeq.collect {
case obj if obj.formatType == "text alignment" => obj.value
}.head == "right")
I would use the Jackson library, it is very helpful for parsing JSON. You can read the JSON using an ObjectMapper.
Here is a full tutorial to get you started: https://www.mkyong.com/java/jackson-how-to-parse-json/
create a multiline JSON string, then parse that string directly into a Scala object, use the net.liftweb package to solve this.
import net.liftweb.json._
object SarahEmailPluginConfigTest {
implicit val formats = DefaultFormats
case class Mailserver(url: String, username: String, password: String)
val json = parse(
"""
{
"url": "imap.yahoo.com",
"username": "myusername",
"password": "mypassword"
}
"""
)
def main(args: Array[String]) {
val m = json.extract[Mailserver]
println(m.url)
println(m.username)
println(m.password)
}
}
https://alvinalexander.com/scala/simple-scala-lift-json-example-lift-framework
Reference link

Spark: Splitting JSON strings into separate dataframe columns

Im loading the below JSON string into a dataframe column.
{
"title": {
"titleid": "222",
"titlename": "ABCD"
},
"customer": {
"customerDetail": {
"customerid": 878378743,
"customerstatus": "ACTIVE",
"customersystems": {
"customersystem1": "SYS01",
"customersystem2": null
},
"sysid": null
},
"persons": [{
"personid": "123",
"personname": "IIISKDJKJSD"
},
{
"personid": "456",
"personname": "IUDFIDIKJK"
}]
}
}
val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)
//Schema:
StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))
//Output:
+------------------------------+---------------------------------------+
|customerDetail |persons |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+
My Question: Is there a way that I can split the key value as a separate dataframe columns like below
by keeping the Array columns as is since I need to have only one record per json string:
Example for customer column:
customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }
Edited post :
val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
sc match {
case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
case _ => Seq(field)
}
}
val fields = collectFields("",df.schema).map(_.tail)
df.select(fields.map(col):_*).show(false)
Output :
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE |SYS01 |null |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222 |ABCD |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

How to insert an empty object into JSON using Circe?

I'm getting a JSON object over the network, as a String. I'm then using Circe to parse it. I want to add a handful of fields to it, and then pass it on downstream.
Almost all of that works.
The problem is that my "adding" is really "overwriting". That's actually ok, as long as I add an empty object first. How can I add such an empty object?
So looking at the code below, I am overwriting "sometimes_empty:{}" and it works. But because sometimes_empty is not always empty, it results in some data loss. I'd like to add a field like: "custom:{}" and then ovewrite the value of custom with my existing code.
Two StackOverflow posts were helpful. One worked, but wasn't quite what I was looking for. The other I couldn't get to work.
1: Modifying a JSON array in Scala with circe
2: Adding field to a json using Circe
val js: String = """
{
"id": "19",
"type": "Party",
"field": {
"id": 1482,
"name": "Anne Party",
"url": "https"
},
"sometimes_empty": {
},
"bool": true,
"timestamp": "2018-12-18T11:39:18Z"
}
"""
val newJson = parse(js).toOption
.flatMap { doc =>
doc.hcursor
.downField("sometimes_empty")
.withFocus(_ =>
Json.fromFields(
Seq(
("myUrl", Json.fromString(myUrl)),
("valueZ", Json.fromString(valueZ)),
("valueQ", Json.fromString(valueQ)),
("balloons", Json.fromString(balloons))
)
)
)
.top
}
newJson match {
case Some(v) => return v.toString
case None => println("Failure!")
}
We need to do a couple of things. First, we need to zoom in on the specific property we want to update, if it doesn't exist, we'll create a new empty one. Then, we turn the zoomed in property in the form of a Json into JsonObject in order to be able to modify it using the +: method. Once we've done that, we need to take the updated property and re-introduce it in the original parsed JSON to get the complete result:
import io.circe.{Json, JsonObject, parser}
import io.circe.syntax._
object JsonTest {
def main(args: Array[String]): Unit = {
val js: String =
"""
|{
| "id": "19",
| "type": "Party",
| "field": {
| "id": 1482,
| "name": "Anne Party",
| "url": "https"
| },
| "bool": true,
| "timestamp": "2018-12-18T11:39:18Z"
|}
""".stripMargin
val maybeAppendedJson =
for {
json <- parser.parse(js).toOption
sometimesEmpty <- json.hcursor
.downField("sometimes_empty")
.focus
.orElse(Option(Json.fromJsonObject(JsonObject.empty)))
jsonObject <- json.asObject
emptyFieldJson <- sometimesEmpty.asObject
appendedField = emptyFieldJson.+:("added", Json.fromBoolean(true))
res = jsonObject.+:("sometimes_empty", appendedField.asJson)
} yield res
maybeAppendedJson.foreach(obj => println(obj.asJson.spaces2))
}
}
Yields:
{
"id" : "19",
"type" : "Party",
"field" : {
"id" : 1482,
"name" : "Anne Party",
"url" : "https"
},
"sometimes_empty" : {
"added" : true,
"someProperty" : true
},
"bool" : true,
"timestamp" : "2018-12-18T11:39:18Z"
}

Scala JSON: Collect and parse an array of arrays with lift

I have a JSON that has array of arrays. Like this:
{
"id": "532242",
"text": "Some text here. And Here",
"analysis": {
"exec": "true",
"rowID": "always",
"sentences": {
"next": null,
"data": [{
"sequence": "1",
"readability_score_lexical": null,
"readability_score_syntax": null,
"tokens": [{
"word": "Some",
"lemma": "Some"
},
{
"word": "text",
"lemma": "text"
}
]
},
{
"sequence": "3",
"readability_score_lexical": null,
"readability_score_syntax": null,
"tokens": [{
"word": "and",
"lemma": "And"
},
{
"word": "here",
"lemma": "here"
}
]
}
]
}
}
}
The structure is pretty complicated, but I cannot do anything on this side because is the response from an API.
What I need is to get a list "tokens" objects.
I did this with lift-web-json:
case class Token(word:String, lemma:String)
implicit val formats: Formats = net.liftweb.json.DefaultFormats
val jsonObj = net.liftweb.json.parse(json)
val tokens = (jsonObj \\ "tokens").children
for (el <- tokens) {
val m = el.extract[Token]
println(s"Word ${m.word} and ${m.lemma}")
}
but it says:
net.liftweb.json.MappingException: No usable value for word
Do not know how to convert JArray(List(JField(word,JString(Some)), JField(word,JString(text))))
[...]
Caused by: net.liftweb.json.MappingException: Do not know how to convert JArray(List(JField(word,JString(Some)), JField(word,JString(text)))) into class java.lang.String
And I don't understand how could I make it right.
You should get what you expect by replacing:
val tokens = (jsonObj \\ "tokens").children
for (el <- tokens) {
val m = el.extract[Token]
println(s"Word ${m.word} and ${m.lemma}")
}
with:
val tokens = for {
// tokenList are:
// JArray(List(JObject(List(JField(word,JString(Some)), JField(lemma,JString(Some)))), JObject(List(JField(word,JString(text)), JField(lemma,JString(text))))))
// JArray(List(JObject(List(JField(word,JString(and)), JField(lemma,JString(And)))), JObject(List(JField(word,JString(here)), JField(lemma,JString(here))))))
tokenList <- (jsonObj \\ "tokens").children
// subTokenList are:
// List(JObject(List(JField(word,JString(Some)), JField(lemma,JString(Some)))), JObject(List(JField(word,JString(text)), JField(lemma,JString(text)))))
// List(JObject(List(JField(word,JString(and)), JField(lemma,JString(And)))), JObject(List(JField(word,JString(here)), JField(lemma,JString(here)))))
JArray(subTokenList) <- tokenList
// liftToken are:
// JObject(List(JField(word,JString(Some)), JField(lemma,JString(Some))))
// JObject(List(JField(word,JString(text)), JField(lemma,JString(text))))
// JObject(List(JField(word,JString(and)), JField(lemma,JString(And))))
// JObject(List(JField(word,JString(here)), JField(lemma,JString(here))))
liftToken <- subTokenList
// token are:
// Token(Some,Some)
// Token(text,text)
// Token(and,And)
// Token(here,here)
token = liftToken.extract[Token]
} yield token