I have a Spark job, which has a DataFrame with the following value :
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type": {
"isMale": true,
"id": "dd",
"mcc": 1234,
"name": "Adam"
}
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2": {
"isMale": true,
"id": "dd",
"mcc": 12134,
"name": "Perth"
}
}
}
and I want to flatten it out elegantly (as no of keys is unknown and type etc) in such a way that props remains as a struct but everything inside it is flattened off (irrespective of the level of nesting)
The output desired is :
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type.isMale": true,
"type.id": "dd",
"type.mcc": 1234,
"type.name": "Adam"
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2.isMale": true,
"type2.id": "dd",
"type2.mcc": 12134,
"type2.name": "Perth"
}
}
I used the solution mentioned in
Automatically and Elegantly flatten DataFrame in Spark SQL
however, I'm unable to keep the props field intact. It also gets flattened off.
Can somebody help me with extending this solution?
The final schema should be something like :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
|-- type2.id: string (nullable = true)
| |-- type2.isMale: boolean (nullable = true)
| |-- type2.mcc: long (nullable = true)
| |-- type2.name: string (nullable = true)
|-- test_id: string (nullable = true)
I've been able to achieve this with the RDD API :
val jsonRDD = df.rdd.map{row =>
def unnest(r: Row): Map[String, Any] = {
r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
(f.name, f.dataType) match {
case ("props", _:StructType) =>
val propsObject = r.getAs[Row](f.name)
Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
val subObject = propsObject.getAs[Row](propsAttr.name)
subObject.schema.fields.map{subField =>
s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
}
}.toMap)
case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
case _ => Map(f.name -> r.get(i))
}
}
}.toMap
val asMap = unnest(row)
new ObjectMapper().registerModule(DefaultScalaModule).writeValueAsString(asMap)
}
val finalDF = spark.read.json(jsonRDD.toDS).cache
The solution should accept deeply nested inputs, thanks to recursion.
With your data, here's what we get :
finalDF.printSchema()
finalDF.show(false)
finalDF.select("props.*").show()
Outputs :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
|-- test_id: string (nullable = true)
+-------+----------------------+-------+
|id |props |test_id|
+-------+----------------------+-------+
|abchchd|[dd, true, 1234, Adam]|ndsbsb |
+-------+----------------------+-------+
+-------+-----------+--------+---------+
|type.id|type.isMale|type.mcc|type.name|
+-------+-----------+--------+---------+
| dd| true| 1234| Adam|
+-------+-----------+--------+---------+
But we can also pass more nested/complexe structures like for instance :
val str2 = """{"newroot":[{"mystruct":{"id":"abchchd","test_id":"ndsbsb","props":{"type":{"isMale":true,"id":"dd","mcc":1234,"name":"Adam"}}}}]}"""
...
finalDF.printSchema()
finalDF.show(false)
Gives the following output :
root
|-- newroot: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- mystruct: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- props: struct (nullable = true)
| | | | |-- type.id: string (nullable = true)
| | | | |-- type.isMale: boolean (nullable = true)
| | | | |-- type.mcc: long (nullable = true)
| | | | |-- type.name: string (nullable = true)
| | | |-- test_id: string (nullable = true)
+---------------------------------------------+
|root |
+---------------------------------------------+
|[[[abchchd, [dd, true, 1234, Adam], ndsbsb]]]|
+---------------------------------------------+
EDIT: As you mentioned, if you have records with different structure you need to wrap the above subObject value in an option.
Here's the fixed unnest function :
def unnest(r: Row): Map[String, Any] = {
r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
(f.name, f.dataType) match {
case ("props", _:StructType) =>
val propsObject = r.getAs[Row](f.name)
Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
val subObjectOpt = Option(propsObject.getAs[Row](propsAttr.name))
subObjectOpt.toSeq.flatMap{subObject => subObject.schema.fields.map{subField =>
s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
}}
}.toMap)
case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
case _ => Map(f.name -> r.get(i))
}
}
}.toMap
New printSchema gives :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
| |-- type2.id: string (nullable = true)
| |-- type2.isMale: boolean (nullable = true)
| |-- type2.mcc: long (nullable = true)
| |-- type2.name: string (nullable = true)
|-- test_id: string (nullable = true)
Related
I have a metadata file with a column with information on the schema of a file:
[{"column_datatype": "varchar", "column_description": "Indicates whether the Customer belongs to a particular business size, business activity, retail segment, demography, or other group and is used for reporting on regio performance regio migration.", "column_length": "4", "column_name": "clnt_grp_cd", "column_personally_identifiable_information": "False", "column_precision": "4", "column_primary_key": "True", "column_scale": null, "column_security_classifications": [], "column_sequence_number": "1"}
root
|-- column_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column_datatype: string (nullable = true)
| | |-- column_description: string (nullable = true)
| | |-- column_length: string (nullable = true)
| | |-- column_name: string (nullable = true)
| | |-- column_personally_identifiable_information: string (nullable = true)
| | |-- column_precision: string (nullable = true)
| | |-- column_primary_key: string (nullable = true)
| | |-- column_scale: string (nullable = true)
| | |-- column_security_classifications: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- column_sequence_number: string (nullable = true)
I want to read a df using this schema. Something like:
schema = StructType([ \
StructField("clnt_grp_cd",StringType(),True),\
StructField("clnt_grp_lvl1_nm",StringType(),True),\
(...)
])
df = spark.read.schema(schema).format("csv").option("header","true").load(filenamepath)
Is there a built in method to parse this as a schema?
I am trying to pull out data as below from data frame. The Json data which has nested arrays is completely in one column(_c1). I want to pull it out and create it as separate data frame with valid column names. One sample record would be as below.
|_c1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}}
I am reading it to a schema as,
val schema=StructType(Array(
StructField("Id", StringType, false),
StructField("Type", StringType, false),
StructField("client", StringType, false),
StructField("eventTime", StringType, false),
StructField("eventType", StringType, false),
StructField("payload", ArrayType(StructType(Array(
StructField("sourceApp", StringType, false),
StructField("questionnaire", ArrayType(StructType(Array(
StructField("version", StringType, false),
StructField("question", StringType, false),
StructField("fb", StringType, false)))))
))))
))
val json_paral = DF.select(from_json(col("_c1"),schema))
`
Structure comes out as below,
`
|-- jsontostructs(_c1): struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- sourceApp: string (nullable = true)
| | | |-- questionnaire: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- version: string (nullable = true)
| | | | | |-- question: string (nullable = true)
| | | | | |-- fb: string (nullable = true)
The structure is good but when I check the dataframe all data is coming out as NULL. Is the read fine ? Not getting any parsing issues either.
Please check if this helps-
1. Load the data
val data = """{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} """
val df = Seq(data).toDF("jsonCol")
df.show(false)
df.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|jsonCol |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- jsonCol: string (nullable = true)
2. extract the json string to separate fileds
df.select(json_tuple(col("jsonCol"), "Id", "Type", "client", "eventTime", "eventType", "payload"))
.show(false)
Output-
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|c0 |c1 |c2 |c3 |c4 |c5 |
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|31279605299|12121212|Checklist _API|2020-03-17T15:50:30.640Z|Event|{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}|
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
3. using from_json(..)
val processed = df.select(
expr("from_json(jsonCol, 'struct<Id:string,Type:string,client:string,eventTime:string, eventType:string," +
"payload:struct<questionnaire:struct<fb:string,question:string,version:string>,sourceApp:string>>')")
.as("json_converted"))
processed.show(false)
processed.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------+
|json_converted |
+-------------------------------------------------------------------------------------------------------------+
|[31279605299, 12121212, Checklist _API, 2020-03-17T15:50:30.640Z, Event, [[Na, How to resolve ? , 1.0], ios]]|
+-------------------------------------------------------------------------------------------------------------+
root
|-- json_converted: struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: struct (nullable = true)
| | |-- questionnaire: struct (nullable = true)
| | | |-- fb: string (nullable = true)
| | | |-- question: string (nullable = true)
| | | |-- version: string (nullable = true)
| | |-- sourceApp: string (nullable = true)
Instead of reading it to schema I tried making it to a value as
val Df = json_DF.map(r => r.getString(0))
This will pull the data as a string on which the below would pull it out with the keys as column names.
val g1DF=spark.read.json(Df)
Did some lateral view explode nested to pull out nested array values.
Am having DataFrame with 1 column as String .
sample String : The below content is completely stored as Single line string inside the DataFrame
[{"id":"2ac3a7d6-46d8-4903-8a92-000000000003","type":"Stemming","name":"Drill Cuttings","dateCreated":"2019-08-23T05:22:51.7050657Z","dateModified":"2019-08-23T05:22:51.7050657Z","dateDeleted":null,"isDeleted":false,"abbreviation":null,"supplierName":null,"shotPlusReference":null,"restrictedTo":[],"metadata":{},"documentLinks":[]},{"EType":"ProductSave"}]
I have tried like below. Convert my DF to an RDD and using spark.read.json read the RDD
val newRDD = MyDF.rdd.map(_.getString(0))
val ExpectedDF1 = spark.read.json(newRDD.toString())
ExpectedDF1.printSchema()
When printing the schema {"EType":"ProductSave"} this field is missing
Where as other fields are proper.
1) val newRDD = MyDF.rdd.map(_.getString(0))
val ExpectedDF1 = spark.read.json(newRDD.toString())
ExpectedDF1.printSchema()
2) val newRDD = MyDF.rdd.map(row=> (row.getAs[Json](1).toString))
val tempDF= spark.read.json(newRDD)
tempDF.printSchema()
3)
val JsonDF = MyDF.toJSON.toDF("value")
val convertedRDD = spark.read.json(JsonDF.rdd)
convertedRDD.printSchema()
All these three ways leaving the last field in the schema.
{"EType":"ProductSave"}
Actual schema returned
root
|-- abbreviation: string (nullable = true)
|-- dateCreated: string (nullable = true)
|-- dateDeleted: string (nullable = true)
|-- dateModified: string (nullable = true)
|-- density: double (nullable = true)
|-- displayColor: string (nullable = true)
|-- documentLinks: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- isDeleted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dateModified: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isDeleted: boolean (nullable = true)
| | |-- variants: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- isDeleted: boolean (nullable = true)
|-- restrictedTo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- shotPlusReference: string (nullable = true)
|-- supplierName: string (nullable = true)
|-- type: string (nullable = true)
Expected Schema
root
|-- EType: string (nullable = true)
|-- abbreviation: string (nullable = true)
|-- dateCreated: string (nullable = true)
|-- dateDeleted: string (nullable = true)
|-- dateModified: string (nullable = true)
|-- density: double (nullable = true)
|-- displayColor: string (nullable = true)
|-- documentLinks: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- isDeleted: boolean (nullable = true)
|-- name: string (nullable = true)
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dateModified: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isDeleted: boolean (nullable = true)
| | |-- variants: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- isDeleted: boolean (nullable = true)
|-- restrictedTo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- shotPlusReference: string (nullable = true)
|-- supplierName: string (nullable = true)
|-- type: string (nullable = true)
I have this kind of JSON data:
{
"data": [
{
"id": "4619623",
"team": "452144",
"created_on": "2018-10-09 02:55:51",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
},
{
"id": "4619600",
"team": "452144",
"created_on": "2018-10-09 02:42:25",
"links": {
"edit": "https://some_page",
"publish": "https://some_publish",
"default": "https://some_default"
}
}
}
I read this data using Apache spark and I want to write them partition by id column. When I use this:
df.write.partitionBy("data.id").json(<path_to_folder>)
I will get error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Partition column data.id not found in schema
I also tried to use explode function like that:
import org.apache.spark.sql.functions.{col, explode}
val renamedDf= df.withColumn("id", explode(col("data.id")))
renamedDf.write.partitionBy("id").json(<path_to_folder>)
That actually helped, but each id partition folder contained the same original JSON file.
EDIT: schema of df DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
Schema of renamedDf DataFrame:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
|-- id: string (nullable = true)
I am using spark 2.1.0
I found this solution: DataFrame partitionBy on nested columns
And this example:http://bigdatums.net/2016/02/12/how-to-extract-nested-json-data-in-spark/
But none of this helped me to solve my problem.
Thanks in andvance for any help.
try the following code:
val renamedDf = df
.select(explode(col("data")) as "x" )
.select($"x.*")
renamedDf.write.partitionBy("id").json(<path_to_folder>)
You are just missing a select statement after the initial explode
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("/FileStore/tables/test.json")
df.printSchema
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_on: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- links: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- edit: string (nullable = true)
| | | |-- publish: string (nullable = true)
| | |-- team: string (nullable = true)
import org.apache.spark.sql.functions.{col, explode}
val df1= df.withColumn("data", explode(col("data")))
df1.printSchema
root
|-- data: struct (nullable = true)
| |-- created_on: string (nullable = true)
| |-- id: string (nullable = true)
| |-- links: struct (nullable = true)
| | |-- default: string (nullable = true)
| | |-- edit: string (nullable = true)
| | |-- publish: string (nullable = true)
| |-- team: string (nullable = true)
val df2 = df1.select("data.created_on","data.id","data.team","data.links")
df2.show
+-------------------+-------+------+--------------------+
| created_on| id| team| links|
+-------------------+-------+------+--------------------+
|2018-10-09 02:55:51|4619623|452144|[https://some_def...|
|2018-10-09 02:42:25|4619600|452144|[https://some_def...|
+-------------------+-------+------+--------------------+
df2.write.partitionBy("id").json("FileStore/tables/test_part.json")
val f = spark.read.json("/FileStore/tables/test_part.json/id=4619600")
f.show
+-------------------+--------------------+------+
| created_on| links| team|
+-------------------+--------------------+------+
|2018-10-09 02:42:25|[https://some_def...|452144|
+-------------------+--------------------+------+
val full = spark.read.json("/FileStore/tables/test_part.json")
full.show
+-------------------+--------------------+------+-------+
| created_on| links| team| id|
+-------------------+--------------------+------+-------+
|2018-10-09 02:55:51|[https://some_def...|452144|4619623|
|2018-10-09 02:42:25|[https://some_def...|452144|4619600|
+-------------------+--------------------+------+-------+
I have a jsonfile to be parsed.The json format is like this :
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?
This is the result I want:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English MasterĀ 1984 New York
This might not be the best way of doing it but you can give it a shot.
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations column
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+