I have the following JSON objects:
{
"user_id": "123",
"data": {
"city": "New York"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}
I'm using val df = sqlContext.read.json("json") to read the file to dataframe
Which combines all data attributes into data struct like so:
root
|-- data: struct (nullable = true)
| |-- age: string (nullable = true)
| |-- city: string (nullable = true)
| |-- name: string (nullable = true)
| |-- occupation: string (nullable = true)
|-- session_id: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- user_id: string (nullable = true)
Is it possible to transform data field to MAP[String, String] Data type? And so it only has the same attributes as original json?
Yes you can achieve that by exporting a Map[String, String] from the JSON data as shown next:
import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.functions.{to_json, from_json}
val jsonStr = """{
"user_id": "123",
"data": {
"name": "some_name",
"age": "23",
"occupation": "teacher"
},
"timestamp": "1563188698.31",
"session_id": "6a793439-6535-4162-b333-647a6761636b"
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val mappingSchema = MapType(StringType, StringType)
df.select(from_json(to_json($"data"), mappingSchema).as("map_data"))
//Output
// +-----------------------------------------------------+
// |map_data |
// +-----------------------------------------------------+
// |[age -> 23, name -> some_name, occupation -> teacher]|
// +-----------------------------------------------------+
First we extract the content of the data field into a string with to_json($"data"), then we parse and extract the Map with from_json(to_json($"data"), schema).
Not sure what you mean to convert it to a Map of (String, String), But see if below can help.
val dataDF = spark.read.option("multiline","true").json("madhu/user.json").select("data").toDF
dataDF
.withColumn("age", $"data"("age")).withColumn("city", $"data"("city"))
.withColumn("name", $"data"("name"))
.withColumn("occupation", $"data"("occupation"))
.drop("data")
.show
Related
I have a schema for json data defined as
val gpsSchema: StructType =
StructType(Array(
StructField("Name",StringType,true),
StructField("GPS", ArrayType(
StructType(Array(
StructField("TimeStamp",DoubleType,true),
StructField("Longitude", DoubleType, true),
StructField("Latitude",DoubleType,true)
)),true),true)))
data
{"Name":"John","GPS":[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]}
How can I add a new StructField "ID" (uid) to the GPS array such that
before
[{"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
after
[{"ID": 123,"TimeStamp": 1605449171.259277, "Longitude": -76.463684, "Latitude": 40.787052},
{"ID": 123, "TimeStamp": 1605449175.743052, "Longitude": -76.464046, "Latitude": 40.787038},
{"ID": 123,"TimeStamp": 1605449180.932659, "Longitude": -76.464465, "Latitude": 40.787022},
{"ID": 123,"TimeStamp": 1605449187.288478, "Longitude": -76.464977, "Latitude": 40.787054}]
One way is to flatten the nested fields, add the new column "ID", use struct("ID","TimeStamp","Longitude","Latitude") and perform a collect_list as below:-
Dataframe
.withColumn( "ID", uuid())
.withColumn("GPS", explode($"GPS"))
.select($"ID", $"Name", $"GPS.*")
.select($"Name" ,struct("ID","TimeStamp","Longitude","Latitude").alias("field"))
.groupBy("Name").agg(collect_list($"field"))
This will be an expensive operation if the are a large number of elements in the array which may cause the spark driver to crash
Is there another way to just add in the "ID" field within the GPS array of the existing schema?
If you don't want to use explode, groupBy & collect_list, Try below code.
scala> df.printSchema
root
|-- GPS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Latitude: double (nullable = true)
| | |-- Longitude: double (nullable = true)
| | |-- TimeStamp: double (nullable = true)
|-- Name: string (nullable = true)
scala> :paste
// Entering paste mode (ctrl-D to finish)
val addCol = udf((id:String,json:String) => {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
import org.json4s.JsonDSL._
compact(parse(json).extract[List[Map[String,String]]].map(m => m ++ Map("id" -> id)))
})
// Exiting paste mode, now interpreting.
addCol: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3755/1542889653#a4d1d2c,StringType,List(Some(class[value[0]: string]), Some(class[value[0]: string])),None,true,true)
scala> df.withColumn("GPS_New",addCol(uuid,to_json($"GPS"))).show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|GPS |Name|GPS_New |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[40.787052, -76.463684, 1.605449171259277E9], [40.787038, -76.464046, 1.605449175743052E9], [40.787022, -76.464465, 1.605449180932659E9], [40.787054, -76.464977, 1.605449187288478E9]]|John|[{"Latitude":"40.787052","Longitude":"-76.463684","TimeStamp":"1.605449171259277E9","id":"123"},{"Latitude":"40.787038","Longitude":"-76.464046","TimeStamp":"1.605449175743052E9","id":"123"},{"Latitude":"40.787022","Longitude":"-76.464465","TimeStamp":"1.605449180932659E9","id":"123"},{"Latitude":"40.787054","Longitude":"-76.464977","TimeStamp":"1.605449187288478E9","id":"123"}]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I am trying to read this json files.
{
"data": [{
"id": "c1",
"type": "corporate",
"tenor": "10.3 years",
"yield": "5.30%",
"amount_outstanding": 1200000
},
{
"id": "g1",
"type": "government",
"tenor": "9.4 years",
"yield": "3.70%",
"amount_outstanding": 2500000
},
]}
Code
df = spark.read.option("multiline", True).json("sample_input.json")
df.select(col("data")).show()
However this reads everything into single column. Is there a way I can apply schema using id, type, tenor and other columns?
If you try to load multi-line json with permissive mode then you can able see dataframe correctly,
df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount_outstanding: long (nullable = true)
| | |-- id: string (nullable = true)
| | |-- tenor: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- yield: string (nullable = true)
If you remove the data element from json and try to load then it will load correctly like this.
Json content:
[{
"id": "c1",
"type": "corporate",
"tenor": "10.3 years",
"yield": "5.30%",
"amount_outstanding": 1200000
},
{
"id": "g1",
"type": "government",
"tenor": "9.4 years",
"yield": "3.70%",
"amount_outstanding": 2500000
}]
Ouput Schema:
df = spark.read.option("multiline", "true").option("mode", "PERMISSIVE").json("sample_input.json")
df.printSchema()
df.printSchema()
root
|-- amount_outstanding: long (nullable = true)
|-- id: string (nullable = true)
|-- tenor: string (nullable = true)
|-- type: string (nullable = true)
|-- yield: string (nullable = true)
In both option you have correct data frame once it is correctly populated extract and transform data the way you want.
Visitors of an eCommerce site browse multiple products during their visit. All visit data of a visitor is consolidated in a JSON document containing vistor Id and a list of product Ids, along with an interest attribute containing value of interest expressed by visitor in a product. Here are two example records - rec1 and rec2 containing visit data of two visitors v1 and v2:
val rec1: String = """{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String = """{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
Given the collection of records (visitsData) and a map (productIdToNameMap) of product Ids and their names:
Write the code to enrich every record contained in visitsData with the name of the product. The output should be another sequence with all the original JSON documents enriched with product name. Here is the example output.
val output: Seq[String] = Seq(enrichedRec1, enrichedRec1)
where enrichedRec1 has value -
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
}"""
And enrichedRec2 has value -
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}"""
This is the way to do enrichment of the json
package com.examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions.{col, explode}
import org.apache.spark.sql.{DataFrame, SparkSession}
object EnrichJson extends App {
private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
Logger.getLogger("org").setLevel(Level.WARN)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val rec1: String =
"""{
"visitorId": "v1",
"products": [{
"id": "i1",
"interest": 0.68
}, {
"id": "i2",
"interest": 0.42
}]
}"""
val rec2: String =
"""{
"visitorId": "v2",
"products": [{
"id": "i1",
"interest": 0.78
}, {
"id": "i3",
"interest": 0.11
}]
}"""
val visitsData: Seq[String] = Seq(rec1, rec2)
val productIdToNameMap = Map("i1" -> "Nike Shoes", "i2" -> "Umbrella", "i3" -> "Jeans")
val dictionary = productIdToNameMap.toSeq.toDF("id", "name")
val rddData = spark.sparkContext.parallelize(visitsData)
dictionary.printSchema()
println("for spark version >2.2.0")
var resultDF = spark.read.json(visitsData.toDS)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
println("for spark version <2.2.0")
resultDF = spark.read.json(rddData)
.withColumn("products", explode(col("products")))
.selectExpr("products.*", "visitorId")
.join(dictionary, Seq("id"))
// .withColumn("products", explode(col("products")))
resultDF.show
resultDF.printSchema()
convertJson(resultDF)
/**
* convertJson : converts the data frame to json string
* #param resultDF
*/
private def convertJson(resultDF: DataFrame) = {
import org.apache.spark.sql.functions.{collect_list, _}
val x: DataFrame = resultDF
.groupBy("visitorId")
.agg(collect_list(struct("id", "interest", "name")).as("products"))
x.show
println(x.toJSON.collect.mkString)
}
}
Result :
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
for spark version >2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}
for spark version <2.2.0
+---+--------+---------+----------+
| id|interest|visitorId| name|
+---+--------+---------+----------+
| i1| 0.68| v1|Nike Shoes|
| i2| 0.42| v1| Umbrella|
| i1| 0.78| v2|Nike Shoes|
| i3| 0.11| v2| Jeans|
+---+--------+---------+----------+
root
|-- id: string (nullable = true)
|-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
|-- name: string (nullable = true)
+---------+--------------------+
|visitorId| products|
+---------+--------------------+
| v2|[[i1, 0.78, Nike ...|
| v1|[[i1, 0.68, Nike ...|
+---------+--------------------+
{"visitorId":"v2","products":[{"id":"i1","interest":0.78,"name":"Nike Shoes"},{"id":"i3","interest":0.11,"name":"Jeans"}]}{"visitorId":"v1","products":[{"id":"i1","interest":0.68,"name":"Nike Shoes"},{"id":"i2","interest":0.42,"name":"Umbrella"}]}
Example of method to parse json with scala and give the result in case class
/** ---------------------------------------
*
{
"fields": [
{
"field1": "value",
"field2": [
{
"field21": "value",
"field22": "value"
},
{
"field21": "value",
"field22": "value"
}
]
}
]
}*/
case class elementClass(element1 : String, element2 : String)
case class outputDataClass(field1 : String, exampleClassData : List[elementClass])
def multipleMapJsonParser(jsonDataFile : String) : List[outputDataClass] = {
val JsonData : String = Source.fromFile(jsonDataFile).getLines.mkString
val jsonFormatData = JSON.parseFull(JsonData)
.map{
case json : Map[String, List[Map[String,Any]]] => json("fields").map(
jsonElem =>
outputDataClass(jsonElem("field1").toString,
jsonElem("field2").asInstanceOf[List[Map[String,String]]].map{
case element : Map[String,String] => elementClass(element("field21"),element("field22"))
})
)
}.get
jsonFormatData
}
I have got the json structure as below:
{
"object": "list",
"total": 3,
"data": [
{
"object": "brand",
"id": "15243937043340",
"company": {
"object": "company",
"id": "956936000",
"name": "ABC"
},
"name": "Kindle",
"images": [
"http://www.spacecentrestorage.com/assets/uploads/General/SCS-Slide02-Commercial.jpg"
]
},
{
"object": "brand",
"id": "15243937043340",
"company": {
"object": "company",
"id": "956936000",
"name": "ABC"
},
"name": "Kindle",
"images": [
"http://www.spacecentrestorage.com/assets/uploads/General/SCS-Slide02-Commercial.jpg"
]
},
{
"object": "brand",
"id": "15243937043340",
"company": {
"object": "company",
"id": "956936000",
"name": "ABC"
},
"name": "Kindle",
"images": [
"http://www.spacecentrestorage.com/assets/uploads/General/SCS-Slide02-Commercial.jpg"
]
}
],
"associated": {}
}
And this is my Gson data class mapping :
data class Response (
#SerializedName("object")
val obj: String,
val total: Int,
val data: List<*>,
val associated: Response
)
data class Brand (
#SerializedName("object")
val obj: String,
val id: String,
val name: String,
val images: List<String>,
val company: Company
)
data class Company (
#SerializedName("object")
val obj: String,
val id: String,
val name: String
)
When it comes to extracting the tree as above, I find returned data string becomes Malformed Json and gives MalformedJsonException on $[0].companies.null
I have read about the recursive deserialisation function but it is not working in my case. I resort to deserialise as below , using original method, it causes errors
val response = gson.fromJson(queryResult , Response::class.java)
println("result 2 : $response" )
val dataString = response.data.toString()
println("result 3 : $dataString" )
val brands = Gson().fromJson(dataString, Array<Brand>::class.java).toMutableList()
println("result 4 : $brands" )
I would like to ask :
If returning json component to string, shall all the indents and symbols " be erased ?
To extract all associated object of the elements of the list of objects, what precautions do I have to take for deserialising list of objects using Gson ?
If you set the type parameter of the data list in Response to Brand GSON knows how to deserialise the items of the list.
data class Response (
#SerializedName("object")
val obj: String,
val total: Int,
val data: List<Brand>,
val associated: Response
)
Using this there is no need to parse the items of the list again and you can get all brands like this:
val response = Gson().fromJson(queryResult , Response::class.java)
val dataList = response.data
print("brands: " )
dataList.forEach { println(it) }
Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!
Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.
My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.
Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.
A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300
An example of what I am hoping to achieve is as follows:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
The corresponding flattened output Spark DF structure would be:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters_batter_id_0": "1001",
"batters_batter_type_0": "Regular",
"batters_batter_id_1": "1002",
"batters_batter_type_1": "Chocolate",
"batters_batter_id_2": "1003",
"batters_batter_type_2": "Blueberry",
"batters_batter_id_3": "1004",
"batters_batter_type_3": "Devil's Food",
"topping_id_0": "5001",
"topping_type_0": "None",
"topping_id_1": "5002",
"topping_type_1": "Glazed",
"topping_id_2": "5005",
"topping_type_2": "Sugar",
"topping_id_3": "5007",
"topping_type_3": "Powdered Sugar",
"topping_id_4": "5006",
"topping_type_4": "Chocolate with Sprinkles",
"topping_id_5": "5003",
"topping_type_5": "Chocolate",
"topping_id_6": "5004",
"topping_type_6": "Maple"
}
Not having worked much with Scala and Spark previously, am unsure how to proceed.
Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.
Thanks a lot :)
Here is one possibility we approach it in one of our project
List item
define a case class that maps a row from the dataframe
case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)
List item
map each row from the dataframe to case class
df.map(row => BattersTopics(id = row.getAs[String]("id"), ...,
batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)
Collect to a list and make a Map[String, Any] from the dataframe
val rows = dataSet.collect().toList
rows.map(bt => Map (
"id" -> bt.id,
"type" -> bt.type,
"batters" -> Map(
"batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" ->
bt.batters_batter_type_0), ....) // same for the others id and types
"topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
)
))
Use Jackson to convert the Map[String, Any] to Json
Sample Data : which contains All different types of JSON element (Nested JSON Map, JSON Array, long, String etc..)
{"name":"Akash","age":16,"watches":{"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"]},"phones":[{"name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"]},{"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"]},{"name":"Google","models":["Pixel 3","Pixel 3a"]}]}
root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)
this is the sample data which have arraytype and structtype (Map) values in json Data.
We can use write first two switch conditions for each type and repeat this process unlit it flattens out to the required Dataframe.
https://medium.com/#ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b
Here, is the Spark Java API solution.