Spark: Splitting JSON strings into separate dataframe columns - json

Im loading the below JSON string into a dataframe column.
{
"title": {
"titleid": "222",
"titlename": "ABCD"
},
"customer": {
"customerDetail": {
"customerid": 878378743,
"customerstatus": "ACTIVE",
"customersystems": {
"customersystem1": "SYS01",
"customersystem2": null
},
"sysid": null
},
"persons": [{
"personid": "123",
"personname": "IIISKDJKJSD"
},
{
"personid": "456",
"personname": "IUDFIDIKJK"
}]
}
}
val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)
//Schema:
StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))
//Output:
+------------------------------+---------------------------------------+
|customerDetail |persons |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+
My Question: Is there a way that I can split the key value as a separate dataframe columns like below
by keeping the Array columns as is since I need to have only one record per json string:
Example for customer column:
customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }

Edited post :
val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
sc match {
case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
case _ => Seq(field)
}
}
val fields = collectFields("",df.schema).map(_.tail)
df.select(fields.map(col):_*).show(false)
Output :
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE |SYS01 |null |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222 |ABCD |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+

You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

Related

Groovy: How to parse the json specific key's value into list/array

I am new to groovy and trying
1) from the output of prettyPrint(toJson()), I am trying to get a list of values from a specific key inside an json array using groovy. Using the below JSON output from prettyPrint example below, I am trying to create a list which consists only the values of the name key.
My Code:
def string1 = jiraGetIssueTransitions(idOrKey: jira_id)
echo prettyPrint(toJson(string1.data))
def pretty = prettyPrint(toJson(string1.data))
def valid_strings = readJSON text: "${pretty}"
echo "valid_strings.name : ${valid_strings.name}"
Output of prettyPrint(toJson(string1.data))is below JSON:
{
"expand": "places",
"places": [
{
"id": 1,
"name": "Bulbasaur",
"type": {
"grass",
"poison"
}
},
{
"id": 2,
"name": "Ivysaur",
"type": {
"grass",
"poison"
}
}
}
Expected result
valid_strings.name : ["Bulbasaur", "Ivysaur"]
Current output
valid_strings.name : null
The pretty printed JSON content is invalid.
If the JSON is valid, then names can be accessed as follows:
import groovy.json.JsonSlurper
def text = """
{
"expand": "places",
"places": [{
"id": 1,
"name": "Bulbasaur",
"type": [
"grass",
"poison"
]
},
{
"id": 2,
"name": "Ivysaur",
"type": [
"grass",
"poison"
]
}
]
}
"""
def json = new JsonSlurper().parseText(text)
println(json.places*.name)
Basically, use spray the attribute lookup (i.e., *.name) on the appropriate object (i.e., json.places).
I've used something similar to print out elements within the response in ReadyAPI
import groovy.json.*
import groovy.util.*
def json='[
{ "message" : "Success",
"bookings" : [
{ "bookingId" : 60002172,
"bookingDate" : "1900-01-01T00:00:00" },
{ "bookingId" : 59935582,
"bookingDate" : "1900-01-01" },
{ "bookingId" : 53184048,
"bookingDate" : "2019-01-15",
"testId" : "12803798123",
"overallScore" : "PASS" },
{ "bookingId" : 53183765,
"bookingDate" : "2019-01-15T13:45:00" },
{ "bookingId" : 52783312,
"bookingDate" : "1900-01-01" }
]
}
]
def response = context.expand( json )
def parsedjson = new groovy.json.JsonSlurper().parseText(response)
log.info parsedjson
log.info " Count of records returned: " + parsedjson.size()
log.info " List of bookingIDs in this response: " + parsedjson.bookings*.bookingId

How to insert an empty object into JSON using Circe?

I'm getting a JSON object over the network, as a String. I'm then using Circe to parse it. I want to add a handful of fields to it, and then pass it on downstream.
Almost all of that works.
The problem is that my "adding" is really "overwriting". That's actually ok, as long as I add an empty object first. How can I add such an empty object?
So looking at the code below, I am overwriting "sometimes_empty:{}" and it works. But because sometimes_empty is not always empty, it results in some data loss. I'd like to add a field like: "custom:{}" and then ovewrite the value of custom with my existing code.
Two StackOverflow posts were helpful. One worked, but wasn't quite what I was looking for. The other I couldn't get to work.
1: Modifying a JSON array in Scala with circe
2: Adding field to a json using Circe
val js: String = """
{
"id": "19",
"type": "Party",
"field": {
"id": 1482,
"name": "Anne Party",
"url": "https"
},
"sometimes_empty": {
},
"bool": true,
"timestamp": "2018-12-18T11:39:18Z"
}
"""
val newJson = parse(js).toOption
.flatMap { doc =>
doc.hcursor
.downField("sometimes_empty")
.withFocus(_ =>
Json.fromFields(
Seq(
("myUrl", Json.fromString(myUrl)),
("valueZ", Json.fromString(valueZ)),
("valueQ", Json.fromString(valueQ)),
("balloons", Json.fromString(balloons))
)
)
)
.top
}
newJson match {
case Some(v) => return v.toString
case None => println("Failure!")
}
We need to do a couple of things. First, we need to zoom in on the specific property we want to update, if it doesn't exist, we'll create a new empty one. Then, we turn the zoomed in property in the form of a Json into JsonObject in order to be able to modify it using the +: method. Once we've done that, we need to take the updated property and re-introduce it in the original parsed JSON to get the complete result:
import io.circe.{Json, JsonObject, parser}
import io.circe.syntax._
object JsonTest {
def main(args: Array[String]): Unit = {
val js: String =
"""
|{
| "id": "19",
| "type": "Party",
| "field": {
| "id": 1482,
| "name": "Anne Party",
| "url": "https"
| },
| "bool": true,
| "timestamp": "2018-12-18T11:39:18Z"
|}
""".stripMargin
val maybeAppendedJson =
for {
json <- parser.parse(js).toOption
sometimesEmpty <- json.hcursor
.downField("sometimes_empty")
.focus
.orElse(Option(Json.fromJsonObject(JsonObject.empty)))
jsonObject <- json.asObject
emptyFieldJson <- sometimesEmpty.asObject
appendedField = emptyFieldJson.+:("added", Json.fromBoolean(true))
res = jsonObject.+:("sometimes_empty", appendedField.asJson)
} yield res
maybeAppendedJson.foreach(obj => println(obj.asJson.spaces2))
}
}
Yields:
{
"id" : "19",
"type" : "Party",
"field" : {
"id" : 1482,
"name" : "Anne Party",
"url" : "https"
},
"sometimes_empty" : {
"added" : true,
"someProperty" : true
},
"bool" : true,
"timestamp" : "2018-12-18T11:39:18Z"
}

How to parse just part of JSON with Klaxon?

I'm trying to parse some JSON to kotlin objects. The JSON looks like:
{
data: [
{ "name": "aaa", "age": 11 },
{ "name": "bbb", "age": 22 },
],
otherdata : "don't need"
}
I just need to data part of the entire JSON, and parse each item to a User object:
data class User(name:String, age:Int)
But I can't find an easy way to do it.
Here's one way you can achieve this
import com.beust.klaxon.Klaxon
import java.io.StringReader
val json = """
{
"data": [
{ "name": "aaa", "age": 11 },
{ "name": "bbb", "age": 22 },
],
"otherdata" : "not needed"
}
""".trimIndent()
data class User(val name: String, val age: Int)
fun main(args: Array<String>) {
val klaxon = Klaxon()
val parsed = klaxon.parseJsonObject(StringReader(json))
val dataArray = parsed.array<Any>("data")
val users = dataArray?.let { klaxon.parseFromJsonArray<User>(it) }
println(users)
}
This will work as long as you can fit the whole json string in memory. Otherwise you may want to look into the streaming API: https://github.com/cbeust/klaxon#streaming-api

Put Data in mutlple branch of Array : Json Transformer ,Scala Play

i want to add values to all the arrays in json object.
For eg:
value array [4,2.5,2.5,1.5]
json =
{
"items": [
{
"id": 1,
"name": "one",
"price": {}
},
{
"id": 2,
"name": "two"
},
{
"id": 3,
"name": "three",
"price": {}
},
{
"id": 4,
"name": "four",
"price": {
"value": 1.5
}
}
]
}
i want to transform the above json in
{
"items": [
{
"id": 1,
"name": "one",
"price": {
"value": 4
}
},
{
"id": 2,
"name": "two",
"price": {
"value": 2.5
}
},
{
"id": 3,
"name": "three",
"price": {
"value": 2.5
}
},
{
"id": 4,
"name": "four",
"price": {
"value": 1.5
}
}
]
}
Any suggestions on how do i achieve this. My goal is to put values inside the specific fields of json array. I am using play json library throughout my application. What other options do i have instead of using json transformers.
You may use simple transformation like
val prices = List[Double](4,2.5,2.5,1.5).map(price => Json.obj("price" -> Json.obj("value" -> price)))
val t = (__ \ "items").json.update(
of[List[JsObject]]
.map(_.zip(prices).map(o => _._1 ++ _._2))
.map(JsArray)
)
res5: play.api.libs.json.JsResult[play.api.libs.json.JsObject] = JsSuccess({"items":[{"id":1,"name":"one","price":{"value":4}},{"id":2,"name":"two","price":{"value":2.5}},{"id":3,"name":"three","price":{"value":2.5}},{"id":4,"name":"four","price":{"value":1.5}}]},/items)
I suggest using classes, but not sure this fits to your project because it's hard to guess how your whole codes look like.
I put new Item manually for simplicity. You can create items using Json library :)
class Price(val value:Double) {
override def toString = s"{value:${value}}"
}
class Item(val id: Int, val name: String, val price: Price) {
def this(id: Int, name: String) {
this(id, name, null)
}
override def toString = s"{id:${id}, name:${name}, price:${price}}"
}
val price = Array(4, 2.5, 2.5, 1.5)
/** You might convert Json data to List[Item] using Json library instead. */
val items: List[Item] = List(
new Item(1, "one"),
new Item(2, "two"),
new Item(3, "three"),
new Item(4, "four", new Price(1.5))
)
val valueMappedItems = items.zipWithIndex.map{case (item, index) =>
if (item.price == null) {
new Item(item.id, item.name, new Price(price(index)))
} else {
item
}
}

How to store the json response in an array and sort it

I want to store the values relationship id, relationshipType in an array , sort and then print that array in Groovy.
I have this so far...
def slurper = new JsonSlurper()
def result = slurper.parseText(reponse)
{"RecipientRelationships": [
{
"RelationshipId": "15",
"RelationshipType": "Self"
},
{
"RelationshipId": "1",
"RelationshipType": "Mother"
},
{
"RelationshipId": "2",
"RelationshipType": "Father"
},
Like this?
import groovy.json.JsonSlurper
def response = '''{"RecipientRelationships": [
{
"RelationshipId": "15",
"RelationshipType": "Self"
},
{
"RelationshipId": "1",
"RelationshipType": "Mother"
},
{
"RelationshipId": "2",
"RelationshipType": "Father"
}]
}'''
JsonSlurper slurper = new JsonSlurper()
Map result = slurper.parseText(response)
result.RecipientRelationships.sort {
it.RelationshipId as Integer
}.each {
println it
}
i need to use the results to compare it to the table values
id realtionships
1 father
2 mothera
31 sisterinlaw
23 son
24 daughter
i have already printed the results of the query like this
inside a function()
{
query = "select id , relationships from table"
return result
result= db.rows(query)
}
now i need to compare the results of the table with the response