I have a schema for json data defined as
val gpsSchema: StructType =
StructType(Array(
StructField("Name",StringType,true),
StructField("GPS", ArrayType(
StructType(Array(
StructField("TimeStamp",DoubleType,true),
StructField("Longitude", DoubleType, true),
StructField("Latitude",DoubleType,true)
)),true),true)))
data
{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}
How can I add a new StructField "ID" (uid) to the GPS array such that
before
[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
after
[{"ID": 123,"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"ID": 123, "TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"ID": 123,"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"ID": 123,"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
One way is to flatten the nested fields, add the new column "ID", use struct("ID","TimeStamp","Longitude","Latitude") and perform a collect_list as below:-
Dataframe
.withColumn( "ID", uuid())
.withColumn("GPS", explode($"GPS"))
.select($"ID", $"Name", $"GPS.*")
.select($"Name" ,struct("ID","TimeStamp","Longitude","Latitude").alias("field"))
.groupBy("Name").agg(collect_list($"field"))
This will be an expensive operation if the are a large number of elements in the array which may cause the spark driver to crash
Is there another way to just add in the "ID" field within the GPS array of the existing schema?
If you don't want to use explode, groupBy & collect_list, Try below code.
scala> df.printSchema
root
|-- GPS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Latitude: double (nullable = true)
| | |-- Longitude: double (nullable = true)
| | |-- TimeStamp: double (nullable = true)
|-- Name: string (nullable = true)
scala> :paste
// Entering paste mode (ctrl-D to finish)
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => m ++ Map("id" -> id)))
})
// Exiting paste mode, now interpreting.
addCol: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3755/1542889653#a4d1d2c,StringType,List(Some(class[value[0]: string]), Some(class[value[0]: string])),None,true,true)
scala> df.withColumn("GPS_New",addCol(uuid,to_json($"GPS"))).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|GPS |Name|GPS_New |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[40.787052, -76.463684, 1.605449171259277E9], [40.787038, -76.464046, 1.605449175743052E9], [40.787022, -76.464465, 1.605449180932659E9], [40.787054, -76.464977, 1.605449187288478E9]]|John|[{"Latitude":"40.787052","Longitude":"-76.463684","TimeStamp":"1.605449171259277E9","id":"123"},{"Latitude":"40.787038","Longitude":"-76.464046","TimeStamp":"1.605449175743052E9","id":"123"},{"Latitude":"40.787022","Longitude":"-76.464465","TimeStamp":"1.605449180932659E9","id":"123"},{"Latitude":"40.787054","Longitude":"-76.464977","TimeStamp":"1.605449187288478E9","id":"123"}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
Could someone help me getting the nested JSON to normalized JSON objects, I have deep nested JSON structure. I have seen examples using the flatten structure, However having one flatten structure doesnt help me in my case.
Sample dictionary:
emp = {
'name': "Paul",
'Age': '25',
'Location': "USA",
'Addresses':[
{
"longitude": "987",
"latitude": "765",
"postal_code": "90266" ,
"area" : [{"state": "CA"}]
},
{
"longitude": "123",
"latitude": "456",
"postal_ode": "1234" ,
"area" : [{"state": "NY"}]
}
]
}
expected Normalized objects
emp = ['name': "Paul", 'Age': '25','Location': "USA"}
emp_address = [
{ "longitude": "987", "latitude": "765","postal_code": "90266" },
{ "longitude": "123", "latitude": "456","postal_code": "1234" }
]
emp_address_area = [{"state": "CA"}, {"state": "NY"}]
I tried recursive functions to solve this problem, but no luck.
def normalized_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for k, v in x.items():
if not isinstance (v, (dict, list)):
pass
flatten(v, name + k + '#')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '#')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
print(normalized_json(emp))
Visitors of an eCommerce site browse multiple products during their visit. All visit data of a visitor is consolidated in a JSON document containing vistor Id and a list of product Ids, along with an interest attribute containing value of interest expressed by visitor in a product. Here are two example records - rec1 and rec2 containing visit data of two visitors v1 and v2:
val rec1: String = """{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String = """{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
Given the collection of records (visitsData) and a map (productIdToNameMap) of product Ids and their names:
Write the code to enrich every record contained in visitsData with the name of the product. The output should be another sequence with all the original JSON documents enriched with product name. Here is the example output.
val output: Seq[String] = Seq(enrichedRec1, enrichedRec1)
where enrichedRec1 has value -
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
}"""
And enrichedRec2 has value -
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}"""
This is the way to do enrichment of the json
package com.examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions.{col, explode}
import org.apache.spark.sql.{DataFrame, SparkSession}
object EnrichJson extends App {
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
Logger.getLogger("org").setLevel(Level.WARN)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val rec1: String =
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String =
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
val dictionary = productIdToNameMap.toSeq.toDF("id", "name")
val rddData = spark.sparkContext.parallelize(visitsData)
dictionary.printSchema()
println("for spark version >2.2.0")
var resultDF = spark.read.json(visitsData.toDS)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
println("for spark version <2.2.0")
resultDF = spark.read.json(rddData)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
// .withColumn("products", explode(col("products")))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
/**
* convertJson : converts the data frame to json string
* #param resultDF
*/
private def convertJson(resultDF: DataFrame) = {
import org.apache.spark.sql.functions.{collect_list, _}
val x: DataFrame = resultDF
.groupBy("visitorId")
.agg(collect_list(struct("id", "interest", "name")).as("products"))
x.show
println(x.toJSON.collect.mkString)
}
}
Result :
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
for spark version >2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}
for spark version <2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}
Example of method to parse json with scala and give the result in case class
/** ---------------------------------------
*
{
"fields": [
{
"field1": "value",
"field2": [
{
"field21": "value",
"field22": "value"
},
{
"field21": "value",
"field22": "value"
}
]
}
]
}*/
case class elementClass(element1 : String, element2 : String)
case class outputDataClass(field1 : String, exampleClassData : List[elementClass])
def multipleMapJsonParser(jsonDataFile : String) : List[outputDataClass] = {
val JsonData : String = Source.fromFile(jsonDataFile).getLines.mkString
val jsonFormatData = JSON.parseFull(JsonData)
.map{
case json : Map[String, List[Map[String,Any]]] => json("fields").map(
jsonElem =>
outputDataClass(jsonElem("field1").toString,
jsonElem("field2").asInstanceOf[List[Map[String,String]]].map{
case element : Map[String,String] => elementClass(element("field21"),element("field22"))
})
)
}.get
jsonFormatData
}
I have the following JSON objects:
{
"user_id": "123",
"data": {
"city": "New York"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
I'm using val df = sqlContext.read.json("json") to read the file to dataframe
Which combines all data attributes into data struct like so:
root
|-- data: struct (nullable = true)
| |-- age: string (nullable = true)
| |-- city: string (nullable = true)
| |-- name: string (nullable = true)
| |-- occupation: string (nullable = true)
|-- session_id: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- user_id: string (nullable = true)
Is it possible to transform data field to MAP[String, String] Data type? And so it only has the same attributes as original json?
Yes you can achieve that by exporting a Map[String, String] from the JSON data as shown next:
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.functions.{to_json, from_json}
val jsonStr = """{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val mappingSchema = MapType(StringType, StringType)
df.select(from_json(to_json($"data"), mappingSchema).as("map_data"))
//Output
// +-----------------------------------------------------+
// |map_data |
// +-----------------------------------------------------+
// |[age -> 23, name -> some_name, occupation -> teacher]|
// +-----------------------------------------------------+
First we extract the content of the data field into a string with to_json($"data"), then we parse and extract the Map with from_json(to_json($"data"), schema).
Not sure what you mean to convert it to a Map of (String, String), But see if below can help.
val dataDF = spark.read.option("multiline","true").json("madhu/user.json").select("data").toDF
dataDF
.withColumn("age", $"data"("age")).withColumn("city", $"data"("city"))
.withColumn("name", $"data"("name"))
.withColumn("occupation", $"data"("occupation"))
.drop("data")
.show
I wrote a method to concatenate JSON values.
def mergeSales(storeJValue: JValue): String = {
val salesJValue: JValue = parse(rawJson)
val store = compact(render(storeJValue))
val sales = compact(render(salesJValue))
val mergedSales: String = s"""{"store":$store,"sales":$sales}"""
mergedSales
}
As a result I'm getting strings like this, a store with an array of corresponding sales:
{"store":{"store_id":"01","name":"Store_1"}, "sales":[{"saleId": 10, "name": "New name1", "saleType": "New Type1"}, {"saleId": 20, "name": "Some name1", "saleType": "SomeType5"}, {"saleId": 30, "name": "Some name3", "saleType": "SomeType3"}]}
How should I parse it to get a list of records where the same store is mapped to each sale from the array? I want it to look like this:
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 10, "name": "New name1", "saleType": "New Type1"}}
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 20, "name": "New name2", "saleType": "New Type2"}}
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 30, "name": "Some name3", "saleType": "SomeType3"}}
Sales have a huge amount of fields in reality, so creating a case class will be rather complex.
i think best way to use json4s API which will extract all your json code and convert it into map than you can easily traverse
you required to create case class :
case class Store(store_id: String, name: String)
case class Sale(saleId:String, name:String, saleType:String)
case class Result(store: Store, sale: Sale)
case class SaleStore(store: Store, sales: List[Sale])
then it is very straight forward to get solution using json4s
val str =
"""{
| "store": {
| "store_id": "01",
| "name": "Store_1"
| },
| "sales": [
| {
| "saleId": 10,
| "name": "New name1",
| "saleType": "New Type1"
| },
| {
| "saleId": 20,
| "name": "Some name1",
| "saleType": "SomeType5"
| },
| {
| "saleId": 30,
| "name": "Some name3",
| "saleType": "SomeType3"
| }
| ]
|}""".stripMargin
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = org.json4s.DefaultFormats
val saleStore = parse(str).extract[SaleStore]
val result = saleStore.sales.flatMap(sale => List(saleStore.store -> sale))
val mapper: ObjectMapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
result.map(r => mapper.writeValueAsString(Result(r._1, r._2))).foreach(println)
output:
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"10","name":"New name1","saleType":"New Type1"}}
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"20","name":"Some name1","saleType":"SomeType5"}}
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"30","name":"Some name3","saleType":"SomeType3"}}
Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!
Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.
My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.
Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.
A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300
An example of what I am hoping to achieve is as follows:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
The corresponding flattened output Spark DF structure would be:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters_batter_id_0": "1001",
"batters_batter_type_0": "Regular",
"batters_batter_id_1": "1002",
"batters_batter_type_1": "Chocolate",
"batters_batter_id_2": "1003",
"batters_batter_type_2": "Blueberry",
"batters_batter_id_3": "1004",
"batters_batter_type_3": "Devil's Food",
"topping_id_0": "5001",
"topping_type_0": "None",
"topping_id_1": "5002",
"topping_type_1": "Glazed",
"topping_id_2": "5005",
"topping_type_2": "Sugar",
"topping_id_3": "5007",
"topping_type_3": "Powdered Sugar",
"topping_id_4": "5006",
"topping_type_4": "Chocolate with Sprinkles",
"topping_id_5": "5003",
"topping_type_5": "Chocolate",
"topping_id_6": "5004",
"topping_type_6": "Maple"
}
Not having worked much with Scala and Spark previously, am unsure how to proceed.
Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.
Thanks a lot :)
Here is one possibility we approach it in one of our project
List item
define a case class that maps a row from the dataframe
case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)
List item
map each row from the dataframe to case class
df.map(row => BattersTopics(id = row.getAs[String]("id"), ...,
batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)
Collect to a list and make a Map[String, Any] from the dataframe
val rows = dataSet.collect().toList
rows.map(bt => Map (
"id" -> bt.id,
"type" -> bt.type,
"batters" -> Map(
"batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" ->
bt.batters_batter_type_0), ....) // same for the others id and types
"topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
)
))
Use Jackson to convert the Map[String, Any] to Json
Sample Data : which contains All different types of JSON element (Nested JSON Map, JSON Array, long, String etc..)
{"name":"Akash","age":16,"watches":{"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"]},"phones":[{"name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"]},{"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"]},{"name":"Google","models":["Pixel 3","Pixel 3a"]}]}
root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)
this is the sample data which have arraytype and structtype (Map) values in json Data.
We can use write first two switch conditions for each type and repeat this process unlit it flattens out to the required Dataframe.
https://medium.com/#ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b
Here, is the Spark Java API solution.