I am seeking help in transforming a JSON dataframe/Rdd from one schema structure to another. I am looking at more than 100,000 rows of json data being read and would like to transform that data before inserting into a document store. I am performing the transformations in Spark using scala. Currently i am using json4s framework to parse one json row at a time and transforming it before inserting into the document store, which is not leveraging the power of Spark processing. I would like to transform all the data together, which I believe will speed up the processing.
Following is a code snippet to illustrate my problem.
My input data looks like the inpString string, with multiple rows of JSON.
val inpString= """[{
"data": {
"id": "1234",
"name": "abc",
"time": "2015-01-01 13-44-21",
"x": [
50,
10
],
"y": [
100,
20
],
"z": [
150,
30
],
"x_limit": [
70,
90,
15,
20
],
"y_limit": [
70,
90,
15,
20
],
"z_limit": [
70,
90,
15,
20
]
}},
{
"data": {
"id": "1235",
"name": "cde",
"time": "2015-01-01 3-21-01",
"x": [
50,
10
],
"y": [
100,
20
],
"z": [
150,
30
],
"x_limit": [
70,
90,
15,
20
],
"y_limit": [
70,
90,
15,
20
],
"z_limit": [
70,
90,
15,
20
]
}}]"""
I read it into a dataframe and am able to select, groupby and do all other operations using Sparksql.
val inputRDD = sc.parallelize(inpString::Nil)
val inputDf = sqlContext.read.json(inputRDD)
inputDf.printSchema()
Schema looks like below
root
|-- data: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- time: string (nullable = true)
| |-- x: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- x_limit: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- y: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- y_limit: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- z: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- z_limit: array (nullable = true)
| | |-- element: long (containsNull = true)
The x array in my input data has two types of readings(say type1 and type2) and the x_limit array has the lower and upper limits for the two types of readings in x array. x_limit[0] is the lower limit for type1, x_limit[1] is the upper limit for type1, x_limit[2] is the lower limit for type2 and x_limit[3] is the upper limit for type2. I need to group all the data for type1 together in one struct and all the data for type2 in one struct.
Following code snippet will give us the output schema
val outString= """[{
"data": {
"id": "1234",
"name": "abc",
"time": "2015-01-01 13-44-21",
"type1": {
"x_axis" : 50,
"y_axis" : 100,
"z_axis" : 150,
"x_lower_limit" : 70,
"x_upper_limit" : 90,
"y_lower_limit" : 70,
"y_upper_limit" : 90,
"z_lower_limit" : 70,
"z_upper_limit" : 90
},
"type2": {
"x_axis" : 10,
"y_axis" : 20,
"z_axis" : 30,
"x_lower_limit" : 15,
"x_upper_limit" : 20,
"y_lower_limit" : 15,
"y_upper_limit" : 20,
"z_lower_limit" : 15,
"z_upper_limit" : 20
}
}
},
{
"data": {
"id": "1235",
"name": "cde",
"time": "2015-01-01 3-21-01",
"type1": {
"x_axis" : 50,
"y_axis" : 100,
"z_axis" : 150,
"x_lower_limit" : 70,
"x_upper_limit" : 90,
"y_lower_limit" : 70,
"y_upper_limit" : 90,
"z_lower_limit" : 70,
"z_upper_limit" : 90
},
"type2": {
"x_axis" : 10,
"y_axis" : 20,
"z_axis" : 30,
"x_lower_limit" : 15,
"x_upper_limit" : 20,
"y_lower_limit" : 15,
"y_upper_limit" : 20,
"z_lower_limit" : 15,
"z_upper_limit" : 20
}
}
}
]"""
val outputRDD = sc.parallelize(outString::Nil)
val outputDf = sqlContext.read.json(outputRDD)
outputDf.printSchema()
Output Schema
root
|-- data: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- time: string (nullable = true)
| |-- type1: struct (nullable = true)
| | |-- x_axis: long (nullable = true)
| | |-- x_lower_limit: long (nullable = true)
| | |-- x_upper_limit: long (nullable = true)
| | |-- y_axis: long (nullable = true)
| | |-- y_lower_limit: long (nullable = true)
| | |-- y_upper_limit: long (nullable = true)
| | |-- z_axis: long (nullable = true)
| | |-- z_lower_limit: long (nullable = true)
| | |-- z_upper_limit: long (nullable = true)
| |-- type2: struct (nullable = true)
| | |-- x_axis: long (nullable = true)
| | |-- x_lower_limit: long (nullable = true)
| | |-- x_upper_limit: long (nullable = true)
| | |-- y_axis: long (nullable = true)
| | |-- y_lower_limit: long (nullable = true)
| | |-- y_upper_limit: long (nullable = true)
| | |-- z_axis: long (nullable = true)
| | |-- z_lower_limit: long (nullable = true)
| | |-- z_upper_limit: long (nullable = true)
I did some research to find for a similar scenario but could not. I appreciate your inputs.
Related
This is my goal:
I try to analyze the json files created by Microsoft's Azure Data Factory. I want to convert them into a set of relational tables.
To explain my problem, I have tried to create a sample with reduced complexity.
You can produce two sample records with below python code:
sample1 = """{
"name": "Pipeline1",
"properties": {
"parameters": {
"a": {"type": "string", "default": ""},
"b": {"type": "string", "default": "chris"},
"c": {"type": "string", "default": "columbus"},
"d": {"type": "integer", "default": "0"}
},
"annotations": ["Test","Sample"]
}
}"""
sample2 = """{
"name": "Pipeline2",
"properties": {
"parameters": {
"x": {"type": "string", "default": "X"},
"y": {"type": "string", "default": "Y"},
},
"annotations": ["another sample"]
}
My first approach to load those data is of course, to read them as json structures:
df = spark.read.json(sc.parallelize([sample1,sample2]))
df.printSchema()
df.show()
but this returns:
root
|-- _corrupt_record: string (nullable = true)
|-- name: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- annotations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- parameters: struct (nullable = true)
| | |-- a: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- b: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- c: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- type: string (nullable = true)
| | |-- d: struct (nullable = true)
| | | |-- default: string (nullable = true)
| | | |-- type: string (nullable = true)
+--------------------+---------+--------------------+
| _corrupt_record| name| properties|
+--------------------+---------+--------------------+
| null|Pipeline1|{[Test, Sample], ...|
|{
"name": "Pipel...|Pipeline2| null|
+--------------------+---------+--------------------+
As you can see, the second sample was not loaded, apparently because the schemas of sample1 and sample2 are different (different names of parameters).
I do not know, why Microsoft has decided to make the parameters elements of a struct and not of an array - but I can't change that.
Let me come back to my goal: I would like to create two dataframes out of those samples:
The first dataframe should contain the annotations (with the columns pipeline_name and annotation), the other dataframe should contain the parameters (with the columns pipeline_name, parameter_name, parameter_type and parameter_default).
Does anybody know a simple way, to convert elements of a struct (not array) into rows of a dataframe?
First of all, I was thinking about a user defined function which converts the json code one by one and loops over the elements of the "parameters" structure to return them as elements of an array. But I did not find out exactly, how to achieve that. I have tried:
import json
from pyspark.sql.types import *
# create a dataframe with the json data as strings
df = spark.createDataFrame([Row(json=sample1), Row(json=sample2)])
#define desired schema
new_schema = StructType([
StructField("pipeline", StructType([
StructField("name", StringType(), True)
,StructField("params", ArrayType(StructType([
StructField("paramname", StringType(), True)
,StructField("type", StringType(), True)
,StructField("default", StringType(), True)
])), None)
,StructField("annotations", ArrayType(StringType()), True)
]), True)
])
def parse_pipeline(source:str):
dict = json.loads(source)
name = dict["name"]
props = dict["properties"]
paramlist = [ ( key, value.get('type'), value.get('default')) for key, value in props.get("parameters",{}).items() ]
annotations = props.get("annotations")
return {'pipleine': { 'name':name, 'params':paramlist, 'annotations': annotations}}
parse_pipeline_udf = udf(parse_pipeline, new_schema)
df = df.withColumn("data", parse_pipeline_udf(F.col("json")))
But this returns an error message: Failed to convert the JSON string '{"metadata":{},"name":"params","nullable":null,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"paramname","nullable":true,"type":"string"},{"metadata":{},"name":"type","nullable":true,"type":"string"},{"metadata":{},"name":"default","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}}' to a field.
Maybe the error comes from the return value of my udf. But if that's the reason, how should I pass the result.
Thank you for any help.
First, I fixed you data sample : """ and } missing, an extra ,:
sample1 = """{
"name": "Pipeline1",
"properties": {
"parameters": {
"a": {"type": "string", "default": ""},
"b": {"type": "string", "default": "chris"},
"c": {"type": "string", "default": "columbus"},
"d": {"type": "integer", "default": "0"}
},
"annotations": ["Test","Sample"]
}
}"""
sample2 = """{
"name": "Pipeline2",
"properties": {
"parameters": {
"x": {"type": "string", "default": "X"},
"y": {"type": "string", "default": "Y"}
},
"annotations": ["another sample"]
}
}"""
Just fixing this, you should have the sample2 included when using your basic code.
But if you want "array", actually, you need a map type.
new_schema = T.StructType([
T.StructField("name", T.StringType()),
T.StructField("properties", T.StructType([
T.StructField("annotations", T.ArrayType(T.StringType())),
T.StructField("parameters", T.MapType(T.StringType(), T.StructType([
T.StructField("default", T.StringType()),
T.StructField("type", T.StringType()),
])))
]))
])
df = spark.read.json(sc.parallelize([sample1, sample2]), new_schema)
and the result :
df.show(truncate=False)
+---------+-----------------------------------------------------------------------------------------------------+
|name |properties |
+---------+-----------------------------------------------------------------------------------------------------+
|Pipeline1|[[Test, Sample], [a -> [, string], b -> [chris, string], c -> [columbus, string], d -> [0, integer]]]|
|Pipeline2|[[another sample], [x -> [X, string], y -> [Y, string]]] |
+---------+-----------------------------------------------------------------------------------------------------+
df.printSchema()
root
|-- name: string (nullable = true)
|-- properties: struct (nullable = true)
| |-- annotations: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- parameters: map (nullable = true)
| | |-- key: string
| | |-- value: struct (valueContainsNull = true)
| | | |-- default: string (nullable = true)
| | | |-- type: string (nullable = true)
I have the following json data :
{
"3200": {
"id": "3200",
"value": [
"cat",
"dog"
]
},
"2000": {
"id": "2000",
"value": [
"bird"
]
},
"2500": {
"id": "2500",
"value": [
"kitty"
]
},
"3650": {
"id": "3650",
"value": [
"horse"
]
}
}
the schema of this data , with printSchema utilty after we load the data with spark is as follows:
root
|-- 3200: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2000: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2500: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 3650: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
and I want to get the following dataframe
id value
3200 cat
2000 bird
2500 kitty
3200 dog
3650 horse
How can I do the parsing to get this expected output
Using spark-sql
Dataframe step (same as in Mohana's answer)
val df = spark.read.json(Seq(jsonData).toDS())
Build a temp view
df.createOrReplaceTempView("df")
Result:
val cols_k = df.columns.map( x => s"`${x}`.id" ).mkString(",")
val cols_v = df.columns.map( x => s"`${x}`.value" ).mkString(",")
spark.sql(s"""
with t1 ( select map_from_arrays(array(${cols_k}),array(${cols_v})) s from df ),
t2 ( select explode(s) (key,value) from t1 )
select key, explode(value) value from t2
""").show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+
You can use stack() function to transpose the dataframe then extract key field and explode value field using explode_outer function.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val jsonData = """{
| "3200": {
| "id": "3200",
| "value": [
| "cat",
| "dog"
| ]
| },
| "2000": {
| "id": "2000",
| "value": [
| "bird"
| ]
| },
| "2500": {
| "id": "2500",
| "value": [
| "kitty"
| ]
| },
| "3650": {
| "id": "3650",
| "value": [
| "horse"
| ]
| }
|}
|""".stripMargin
val df = spark.read.json(Seq(jsonData).toDS())
df.selectExpr("stack (4, *) key")
.select(expr("key.id").as("key"),
explode_outer(expr("key.value")).as("value"))
.show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+
I'm trying to create a dataframe from a json with nested feilds and dates feilds that i'd like to concatenate :
root
|-- MODEL: string (nullable = true)
|-- CODE: string (nullable = true)
|-- START_Time: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- WEIGHT: string (nullable = true)
|-- REGISTED: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- TOTAL: string (nullable = true)
|-- SCHEDULED: struct (nullable = true)
| |-- day: long (nullable = true)
| |-- hour: long (nullable = true)
| |-- minute: long (nullable = true)
| |-- month: long (nullable = true)
| |-- second: long (nullable = true)
| |-- year: long (nullable = true)
|-- PACKAGE: string (nullable = true)
objective is to get a result more like :
+---------+------------------+----------+-----------------+---------+-----------------+
|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |SCHEDULED |
+---------+------------------+----------+-----------------+---------+-----------------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT |yy-mm-dd-hh-mm-ss|TOTAL |yy-mm-dd-hh-mm-ss|
where yy-mm-dd-hh-mm-ss are the conactenation of: day, hour, minute.... in the json
|-- example: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
i have tried explode function may be didn't use it as it should but didn't work
can anyone inspire me for a solution
Thank you
You can do it in below simple steps.
Lets we have the data as below in the data.json file
{"MODEL": "abc", "CODE": "CODE1", "START_Time": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"}, "WEIGHT": "231", "REGISTED": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"}, "TOTAL": "1", "SCHEDULED": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"},"PACKAGE": "CAR"}
This data has the same schema as you shared.
Read this json file in pyspark as below.
from pyspark.sql.functions import *
df = spark.read.json('data.json')
Now you can read the nested values and modify the column values as below.
df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
The output would be
CODE
MODEL
PACKAGE
REGISTED
SCHEDULED
START_Time
TOTAL
WEIGHT
CODE1
abc
CAR
21-08-05-08-30-30
21-08-05-08-30-30
21-08-05-08-30-30
1
231
I have a Spark job, which has a DataFrame with the following value :
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type": {
"isMale": true,
"id": "dd",
"mcc": 1234,
"name": "Adam"
}
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2": {
"isMale": true,
"id": "dd",
"mcc": 12134,
"name": "Perth"
}
}
}
and I want to flatten it out elegantly (as no of keys is unknown and type etc) in such a way that props remains as a struct but everything inside it is flattened off (irrespective of the level of nesting)
The output desired is :
{
"id": "abchchd",
"test_id": "ndsbsb",
"props": {
"type.isMale": true,
"type.id": "dd",
"type.mcc": 1234,
"type.name": "Adam"
}
}
{
"id": "abc",
"test_id": "asf",
"props": {
"type2.isMale": true,
"type2.id": "dd",
"type2.mcc": 12134,
"type2.name": "Perth"
}
}
I used the solution mentioned in
Automatically and Elegantly flatten DataFrame in Spark SQL
however, I'm unable to keep the props field intact. It also gets flattened off.
Can somebody help me with extending this solution?
The final schema should be something like :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
|-- type2.id: string (nullable = true)
| |-- type2.isMale: boolean (nullable = true)
| |-- type2.mcc: long (nullable = true)
| |-- type2.name: string (nullable = true)
|-- test_id: string (nullable = true)
I've been able to achieve this with the RDD API :
val jsonRDD = df.rdd.map{row =>
def unnest(r: Row): Map[String, Any] = {
r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
(f.name, f.dataType) match {
case ("props", _:StructType) =>
val propsObject = r.getAs[Row](f.name)
Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
val subObject = propsObject.getAs[Row](propsAttr.name)
subObject.schema.fields.map{subField =>
s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
}
}.toMap)
case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
case _ => Map(f.name -> r.get(i))
}
}
}.toMap
val asMap = unnest(row)
new ObjectMapper().registerModule(DefaultScalaModule).writeValueAsString(asMap)
}
val finalDF = spark.read.json(jsonRDD.toDS).cache
The solution should accept deeply nested inputs, thanks to recursion.
With your data, here's what we get :
finalDF.printSchema()
finalDF.show(false)
finalDF.select("props.*").show()
Outputs :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
|-- test_id: string (nullable = true)
+-------+----------------------+-------+
|id |props |test_id|
+-------+----------------------+-------+
|abchchd|[dd, true, 1234, Adam]|ndsbsb |
+-------+----------------------+-------+
+-------+-----------+--------+---------+
|type.id|type.isMale|type.mcc|type.name|
+-------+-----------+--------+---------+
| dd| true| 1234| Adam|
+-------+-----------+--------+---------+
But we can also pass more nested/complexe structures like for instance :
val str2 = """{"newroot":[{"mystruct":{"id":"abchchd","test_id":"ndsbsb","props":{"type":{"isMale":true,"id":"dd","mcc":1234,"name":"Adam"}}}}]}"""
...
finalDF.printSchema()
finalDF.show(false)
Gives the following output :
root
|-- newroot: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- mystruct: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- props: struct (nullable = true)
| | | | |-- type.id: string (nullable = true)
| | | | |-- type.isMale: boolean (nullable = true)
| | | | |-- type.mcc: long (nullable = true)
| | | | |-- type.name: string (nullable = true)
| | | |-- test_id: string (nullable = true)
+---------------------------------------------+
|root |
+---------------------------------------------+
|[[[abchchd, [dd, true, 1234, Adam], ndsbsb]]]|
+---------------------------------------------+
EDIT: As you mentioned, if you have records with different structure you need to wrap the above subObject value in an option.
Here's the fixed unnest function :
def unnest(r: Row): Map[String, Any] = {
r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
(f.name, f.dataType) match {
case ("props", _:StructType) =>
val propsObject = r.getAs[Row](f.name)
Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
val subObjectOpt = Option(propsObject.getAs[Row](propsAttr.name))
subObjectOpt.toSeq.flatMap{subObject => subObject.schema.fields.map{subField =>
s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
}}
}.toMap)
case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
case _ => Map(f.name -> r.get(i))
}
}
}.toMap
New printSchema gives :
root
|-- id: string (nullable = true)
|-- props: struct (nullable = true)
| |-- type.id: string (nullable = true)
| |-- type.isMale: boolean (nullable = true)
| |-- type.mcc: long (nullable = true)
| |-- type.name: string (nullable = true)
| |-- type2.id: string (nullable = true)
| |-- type2.isMale: boolean (nullable = true)
| |-- type2.mcc: long (nullable = true)
| |-- type2.name: string (nullable = true)
|-- test_id: string (nullable = true)
I have a jsonfile to be parsed.The json format is like this :
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?
This is the result I want:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English MasterĀ 1984 New York
This might not be the best way of doing it but you can give it a shot.
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations column
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+