pyspark - read json files

pyspark - read json files - json

I am trying to read this json files.
{
"data": [{
"id": "c1",
"type": "corporate",
"tenor": "10.3 years",
"yield": "5.30%",
"amount_outstanding": 1200000
},
{
"id": "g1",
"type": "government",
"tenor": "9.4 years",
"yield": "3.70%",
"amount_outstanding": 2500000
},
]}
Code
df = spark.read.option("multiline", True).json("sample_input.json")
df.select(col("data")).show()
However this reads everything into single column. Is there a way I can apply schema using id, type, tenor and other columns?

If you try to load multi-line json with permissive mode then you can able see dataframe correctly,
df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount_outstanding: long (nullable = true)
| | |-- id: string (nullable = true)
| | |-- tenor: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- yield: string (nullable = true)
If you remove the data element from json and try to load then it will load correctly like this.
Json content:
[{
"id": "c1",
"type": "corporate",
"tenor": "10.3 years",
"yield": "5.30%",
"amount_outstanding": 1200000
},
{
"id": "g1",
"type": "government",
"tenor": "9.4 years",
"yield": "3.70%",
"amount_outstanding": 2500000
}]
Ouput Schema:
df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()
df.printSchema()
root
|-- amount_outstanding: long (nullable = true)
|-- id: string (nullable = true)
|-- tenor: string (nullable = true)
|-- type: string (nullable = true)
|-- yield: string (nullable = true)
In both option you have correct data frame once it is correctly populated extract and transform data the way you want.

Related

Add new StructField to Array of StructFields in scala

I have a schema for json data defined as
val gpsSchema: StructType =
StructType(Array(
StructField("Name",StringType,true),
StructField("GPS", ArrayType(
StructType(Array(
StructField("TimeStamp",DoubleType,true),
StructField("Longitude", DoubleType, true),
StructField("Latitude",DoubleType,true)
)),true),true)))
data
{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}
How can I add a new StructField "ID" (uid) to the GPS array such that
before
[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
after
[{"ID": 123,"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"ID": 123, "TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"ID": 123,"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"ID": 123,"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
One way is to flatten the nested fields, add the new column "ID", use struct("ID","TimeStamp","Longitude","Latitude") and perform a collect_list as below:-
Dataframe
.withColumn( "ID", uuid())
.withColumn("GPS", explode($"GPS"))
.select($"ID", $"Name", $"GPS.*")
.select($"Name" ,struct("ID","TimeStamp","Longitude","Latitude").alias("field"))
.groupBy("Name").agg(collect_list($"field"))
This will be an expensive operation if the are a large number of elements in the array which may cause the spark driver to crash
Is there another way to just add in the "ID" field within the GPS array of the existing schema?

If you don't want to use explode, groupBy & collect_list, Try below code.
scala> df.printSchema
root
|-- GPS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Latitude: double (nullable = true)
| | |-- Longitude: double (nullable = true)
| | |-- TimeStamp: double (nullable = true)
|-- Name: string (nullable = true)
scala> :paste
// Entering paste mode (ctrl-D to finish)
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => m ++ Map("id" -> id)))
})
// Exiting paste mode, now interpreting.
addCol: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3755/1542889653#a4d1d2c,StringType,List(Some(class[value[0]: string]), Some(class[value[0]: string])),None,true,true)
scala> df.withColumn("GPS_New",addCol(uuid,to_json($"GPS"))).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|GPS |Name|GPS_New |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[40.787052, -76.463684, 1.605449171259277E9], [40.787038, -76.464046, 1.605449175743052E9], [40.787022, -76.464465, 1.605449180932659E9], [40.787054, -76.464977, 1.605449187288478E9]]|John|[{"Latitude":"40.787052","Longitude":"-76.463684","TimeStamp":"1.605449171259277E9","id":"123"},{"Latitude":"40.787038","Longitude":"-76.464046","TimeStamp":"1.605449175743052E9","id":"123"},{"Latitude":"40.787022","Longitude":"-76.464465","TimeStamp":"1.605449180932659E9","id":"123"},{"Latitude":"40.787054","Longitude":"-76.464977","TimeStamp":"1.605449187288478E9","id":"123"}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

How to Parse Nested Json using Spark and Scala?

Visitors of an eCommerce site browse multiple products during their visit. All visit data of a visitor is consolidated in a JSON document containing vistor Id and a list of product Ids, along with an interest attribute containing value of interest expressed by visitor in a product. Here are two example records - rec1 and rec2 containing visit data of two visitors v1 and v2:
val rec1: String = """{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String = """{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
Given the collection of records (visitsData) and a map (productIdToNameMap) of product Ids and their names:
Write the code to enrich every record contained in visitsData with the name of the product. The output should be another sequence with all the original JSON documents enriched with product name. Here is the example output.
val output: Seq[String] = Seq(enrichedRec1, enrichedRec1)
where enrichedRec1 has value -
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
}"""
And enrichedRec2 has value -
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}"""

This is the way to do enrichment of the json
package com.examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions.{col, explode}
import org.apache.spark.sql.{DataFrame, SparkSession}
object EnrichJson extends App {
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
Logger.getLogger("org").setLevel(Level.WARN)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val rec1: String =
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String =
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
val dictionary = productIdToNameMap.toSeq.toDF("id", "name")
val rddData = spark.sparkContext.parallelize(visitsData)
dictionary.printSchema()
println("for spark version >2.2.0")
var resultDF = spark.read.json(visitsData.toDS)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
println("for spark version <2.2.0")
resultDF = spark.read.json(rddData)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
// .withColumn("products", explode(col("products")))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
/**
* convertJson : converts the data frame to json string
* #param resultDF
*/
private def convertJson(resultDF: DataFrame) = {
import org.apache.spark.sql.functions.{collect_list, _}
val x: DataFrame = resultDF
.groupBy("visitorId")
.agg(collect_list(struct("id", "interest", "name")).as("products"))
x.show
println(x.toJSON.collect.mkString)
}
}
Result :
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
for spark version >2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}
for spark version <2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}

Example of method to parse json with scala and give the result in case class
/** ---------------------------------------
*
{
"fields": [
{
"field1": "value",
"field2": [
{
"field21": "value",
"field22": "value"
},
{
"field21": "value",
"field22": "value"
}
]
}
]
}*/
case class elementClass(element1 : String, element2 : String)
case class outputDataClass(field1 : String, exampleClassData : List[elementClass])
def multipleMapJsonParser(jsonDataFile : String) : List[outputDataClass] = {
val JsonData : String = Source.fromFile(jsonDataFile).getLines.mkString
val jsonFormatData = JSON.parseFull(JsonData)
.map{
case json : Map[String, List[Map[String,Any]]] => json("fields").map(
jsonElem =>
outputDataClass(jsonElem("field1").toString,
jsonElem("field2").asInstanceOf[List[Map[String,String]]].map{
case element : Map[String,String] => elementClass(element("field21"),element("field22"))
})
)
}.get
jsonFormatData
}

How to convert nested JSON to map object in scala

I have the following JSON objects:
{
"user_id": "123",
"data": {
"city": "New York"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
I'm using val df = sqlContext.read.json("json") to read the file to dataframe
Which combines all data attributes into data struct like so:
root
|-- data: struct (nullable = true)
| |-- age: string (nullable = true)
| |-- city: string (nullable = true)
| |-- name: string (nullable = true)
| |-- occupation: string (nullable = true)
|-- session_id: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- user_id: string (nullable = true)
Is it possible to transform data field to MAP[String, String] Data type? And so it only has the same attributes as original json?

Yes you can achieve that by exporting a Map[String, String] from the JSON data as shown next:
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.functions.{to_json, from_json}
val jsonStr = """{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val mappingSchema = MapType(StringType, StringType)
df.select(from_json(to_json($"data"), mappingSchema).as("map_data"))
//Output
// +-----------------------------------------------------+
// |map_data |
// +-----------------------------------------------------+
// |[age -> 23, name -> some_name, occupation -> teacher]|
// +-----------------------------------------------------+
First we extract the content of the data field into a string with to_json($"data"), then we parse and extract the Map with from_json(to_json($"data"), schema).

Not sure what you mean to convert it to a Map of (String, String), But see if below can help.
val dataDF = spark.read.option("multiline","true").json("madhu/user.json").select("data").toDF
dataDF
.withColumn("age", $"data"("age")).withColumn("city", $"data"("city"))
.withColumn("name", $"data"("name"))
.withColumn("occupation", $"data"("occupation"))
.drop("data")
.show

Parsing Nested JSON using SCALA

I am looking to inject Telemetry data and the output is a multi layered nested JSON file. I am interested in very specific fields but I am not able to parse the JSON file to get to the data.
Data Sample:
{ "version_str": "1.0.0", "node_id_str": "router-01", "encoding_path":
"sys/intf", "collection_id": 241466, "collection_start_time": 0,
"collection_end_time": 0, "msg_timestamp": 0, "subscription_id": [ ],
"sensor_group_id": [ ], "data_source": "DME", "data": {
"interfaceEntity": { "attributes": { "childAction": "", "descr": "",
"dn": "sys/intf", "modTs": "2017-09-19T13:24:14.751+00:00",
"monPolDn": "uni/fabric/monfab-default", "persistentOnReload": "true",
"status": "" }, "children": [ { "l3LbRtdIf": { "attributes": {
"adminSt": "up", "childAction": "", "descr": "Nothing", "id":
"lo103", "linkLog": "default", "modTs":
"2017-11-06T23:18:02.974+00:00", "monPolDn":
"uni/fabric/monfab-default", "name": "", "persistentOnReload": "true",
"rn": "lb-[lo103]", "status": "", "uid": "0" }, "children": [ {
"ethpmLbRtdIf": { "attributes": { "currErrIndex": "4294967295",
"ifIndex": "335544423", "iod": "14", "lastErrors": "0,0,0,0",
"operBitset": "", "operDescr": "Nothing", "operMtu": "1500",
"operSt": "up", "operStQual": "none", "rn": "lbrtdif" } } }, {
"nwRtVrfMbr": { "attributes": { "childAction": "", "l3vmCfgFailedBmp":
"", "l3vmCfgFailedTs": "00:00:00:00.000", "l3vmCfgState": "0",
"modTs": "2017-11-06T23:18:02.945+00:00", "monPolDn": "",
"parentSKey": "unspecified", "persistentOnReload": "true", "rn":
"rtvrfMbr", "status": "", "tCl": "l3Inst", "tDn": "sys/inst-default",
"tSKey": "" } } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt":
"up", "childAction": "", "descr": "Nothing", "id": "lo104",
"linkLog": "default", "modTs": "2018-01-25T15:54:20.367+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo104]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544424", "iod": "77",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2018-01-25T15:53:55.757+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "Nothing", "id": "lo101",
"linkLog": "default", "modTs": "2017-11-13T21:39:58.910+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo101]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544421", "iod": "12",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2017-11-13T21:39:58.880+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "\"^:tier2:if:loopback:mgmt:l3\"", "id":
"lo0", "linkLog": "default", "modTs": "2017-09-25T20:29:54.003+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo0]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544320", "iod": "11",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"\"^:tier2:if:loopback:mgmt:l3\"", "operMtu": "1500", "operSt": "up",
"operStQual": "none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr":...
I am interested in these attributes:
| | | | | | | |-- rmonIfIn: struct (nullable = true)
| | | | | | | | |-- attributes: struct (nullable = true )
| | | | | | | | | |-- broadcastPkts: string (nullabl e = true)
| | | | | | | | | |-- discards: string (nullable = t rue)
| | | | | | | | | |-- errors: string (nullable = tru e)
| | | | | | | | | |-- multicastPkts: string (nullabl e = true)
| | | | | | | | | |-- nUcastPkts: string (nullable = true)
| | | | | | | | | |-- packetRate: string (nullable = true)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.explode
import spark.implicits._
val spark = SparkSession.builder().getOrCreate
val df = spark.read.option("header","true").option("inferSchema","true").json("file:///usr/local/Projects/out.txt")
val mapDF = df.select($"node_id_str" as "nodename", $"data".getItem("InterfaceEntity").getItem("children").getItem("l1PhysIf").getItem("children").getItem("element"))
I keep getting an error when I attempt to get any deeper, I keep getting data type error:
stringJsonDF: org.apache.spark.sql.DataFrame = [nestDevice: string]
org.apache.spark.sql.AnalysisException: cannot resolve '`data`.`InterfaceEntity`.`children`.`l1PhysIf`.`children`['element']' due to data type mismatch: argument 2 requires integral type, however, ''element'' is of string type.;;

You can use Google Gson Library which is used to work with json. You can convert any object to json and of course do it in reverse. here is an example for doing so:
Gson gson = new Gson();
List<Map<Long, String>> listOfMaps = new ArrayList<>();
//here you can new some maps and add them to the listOfMaps.
String listOfMapsInJsonFormat = gson.toJson(listOfMaps);
above sample code is for converting an object to json. To do the reverse job you can check below one too:
Gson gson = new Gson();
List list = gson.fromJson(listOfMapsInJsonFormat, List.class);
the above code will change your input json string to a list which contains maps. Of course there may be a difference in the type of the map you have had before converting the original object to json and the one gson builds the object from json string. to avoid that you can use TypeToken class:
Gson gson = new Gson();
Type type = new TypeToken()<ArrayList<Map<>>>{}.getType();
ArrayList<Map<>> = gson.fromJson(listOfMapsInJsonFormat, type);

Since the fields are part of multiple nested arrays the logic would assume that you are interested in all iterations of those fields per record (so if one record contains n rmonIfIn items due to nested arrays, you would be interested in each of them?)
If so it makes sense to explode these nested arrays and process the expanded dataframe.
Based on your code and incomplete json example it could look like something like this:
val nested = df
.select(explode($"data.InterfaceEntity").alias("l1"))
.select(explode($"l1.l1PhysIf").alias("l2"))
.select($"l2.rmonIfIn.attributes".alias("l3"))
.select($"l3.broadcastPkts", $"l3.discards", $"l3.errors", $"l3.multicastPkts", $"l3.packetRate")
Returning a dataframe that could look like
+-------------+--------+------+-------------+----------+
|broadcastPkts|discards|errors|multicastPkts|packetRate|
+-------------+--------+------+-------------+----------+
|1 |1 |1 |1 |1 |
|2 |2 |2 |2 |2 |
|3 |3 |3 |3 |3 |
|4 |4 |4 |4 |4 |
+-------------+--------+------+-------------+----------+

Flatten Spark JSON Data Frame with nested children attributes having the same names

Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!
Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.
My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.
Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.
A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300
An example of what I am hoping to achieve is as follows:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
The corresponding flattened output Spark DF structure would be:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters_batter_id_0": "1001",
"batters_batter_type_0": "Regular",
"batters_batter_id_1": "1002",
"batters_batter_type_1": "Chocolate",
"batters_batter_id_2": "1003",
"batters_batter_type_2": "Blueberry",
"batters_batter_id_3": "1004",
"batters_batter_type_3": "Devil's Food",
"topping_id_0": "5001",
"topping_type_0": "None",
"topping_id_1": "5002",
"topping_type_1": "Glazed",
"topping_id_2": "5005",
"topping_type_2": "Sugar",
"topping_id_3": "5007",
"topping_type_3": "Powdered Sugar",
"topping_id_4": "5006",
"topping_type_4": "Chocolate with Sprinkles",
"topping_id_5": "5003",
"topping_type_5": "Chocolate",
"topping_id_6": "5004",
"topping_type_6": "Maple"
}
Not having worked much with Scala and Spark previously, am unsure how to proceed.
Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.
Thanks a lot :)

Here is one possibility we approach it in one of our project
List item
define a case class that maps a row from the dataframe
case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)
List item
map each row from the dataframe to case class
df.map(row => BattersTopics(id = row.getAs[String]("id"), ...,
batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)
Collect to a list and make a Map[String, Any] from the dataframe
val rows = dataSet.collect().toList
rows.map(bt => Map (
"id" -> bt.id,
"type" -> bt.type,
"batters" -> Map(
"batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" ->
bt.batters_batter_type_0), ....) // same for the others id and types
"topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
)
))
Use Jackson to convert the Map[String, Any] to Json

Sample Data : which contains All different types of JSON element (Nested JSON Map, JSON Array, long, String etc..)
{"name":"Akash","age":16,"watches":{"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"]},"phones":[{"name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"]},{"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"]},{"name":"Google","models":["Pixel 3","Pixel 3a"]}]}
root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)
this is the sample data which have arraytype and structtype (Map) values in json Data.
We can use write first two switch conditions for each type and repeat this process unlit it flattens out to the required Dataframe.
https://medium.com/#ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b
Here, is the Spark Java API solution.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

pyspark - read json files - json

Related

Add new StructField to Array of StructFields in scala

How to Parse Nested Json using Spark and Scala?

How to convert nested JSON to map object in scala

Parsing Nested JSON using SCALA

Flatten Spark JSON Data Frame with nested children attributes having the same names

Categories

Resources