Parse Nested JSON in Spark

Parse Nested JSON in Spark - json

I have a JSON file whose schema is like this--
root
|-- errorcode: string (nullable = true)
|-- errormessage: string (nullable = true)
|-- ip: string (nullable = true)
|-- label: string (nullable = true)
|-- status: string (nullable = true)
|-- storageidlist: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- errorcode: string (nullable = true)
| | |-- errormessage: string (nullable = true)
| | |-- fedirectorList: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- directorId: string (nullable = true)
| | | | |-- errorcode: string (nullable = true)
| | | | |-- errordesc: string (nullable = true)
| | | | |-- metrics: string (nullable = true)
| | | | |-- portMetricDataList: array (nullable = true)
| | | | | |-- element: array (containsNull = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- data: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- ts: string (nullable = true)
| | | | | | | | | |-- value: string (nullable = true)
| | | | | | | |-- errorcode: string (nullable = true)
| | | | | | | |-- errordesc: string (nullable = true)
| | | | | | | |-- metricid: string (nullable = true)
| | | | | | | |-- portid: string (nullable = true)
| | | | | | | |-- status: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | |-- metrics: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- storageGroupList: string (nullable = true)
| | |-- storageid: string (nullable = true)
|-- sublabel: string (nullable = true)
|-- ts: string (nullable = true)
I am supposed to extract ip,storageid,directorid,metricid,value and ts. In the storageidlist, there is just 1 item, but in the fedirectorList, there are 56 items. But I am unable to parse the JSON beyond storageidlist.
scala> val ip_df = spark.read.option("multiline",true).json("FEDirector_port_data.txt")
ip_df: org.apache.spark.sql.DataFrame = [errorcode: string, errormessage: string ... 6 more fields]
scala> ip_df.select($"storageidlist.storageid").show()
+--------------+
| storageid|
+--------------+
|[000295700670]|
+--------------+
scala> ip_df.select($"storageidlist.fedirectorList.directorId").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`storageidlist`.`fedirectorList`['directorId']' due to data type mismatch: argument 2 requires integral type, however, ''directorId'' is of string type.;;

storageidlist is an array column, so you need to select the first array element and do further selections from that:
ip_df.selectExpr("storageidlist[0].fedirectorList.directorId")
or
ip_df.select($"storageidlist"(0).getField("fedirectorList").getField("directorId"))
It's better to specify an array index whenever you work with array type columns. If you don't specify an array index, you can go 1 level deeper and fetch all the corresponding struct elements in the next level, but you can't go further, as shown in your question.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
file = "<s3path>/<json_file_name.json>"
schema_path = "<s3path>/<json_schame_name.json>"
json_schema = spark.read.json(schema_path, multiLine=True)
df = sqlContext.read.json(file,json_schema.json_schema,multiLine=True)
#display(df)
df.createOrReplaceTempView("temptable")
#example UDF
def parse_nested_list(nested_list):
parsed_str = []
if nested_list:
for item_list in nested_list:
if item_list:
for item in item_list:
if item:
parsed_str.append(item)
return "|".join(parsed_str)
def parse_arrs(x):
if x:
return "| ".join(
", ".join(i for i in e if i is not None) for e in x if e is not None
)
else:
""
sqlContext.udf.register("parse_nested_list", parse_nested_list)
sqlContext.udf.register("parse_arrs", parse_arrs)
structured_df =sqlContext.sql("select parse_nested_list(column1.column2) as column3, parse_arrs(column1) as column2 from temptable")
display(structured_df)
To fetch the nested array, list, dictionaries. You will have to write a UDF to get the nested values and register it to pyspark so that you can use them in sparksql coding.

Related

Update nested struct in spark dataset from another struct column

I have the following spark dataset with a nested struct type:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- customer_servicesegment: string (nullable = true)
| |-- customer_category: string (nullable = true)
my aim here is to do some flattening at the bottom of the structype and have this target schema:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
| |-- _3: struct (nullable = true)
| | |-- customer_servicesegment: string (nullable = true)
| | |-- customer_category: string (nullable = true)
the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)
So here explode utility from spark sql can be used but I don't know where to put it
any help with this please

If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:
val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")
This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.
For older versions, you need to reconstruct the struct column _1:
val df2 = df.withColumn(
"_1",
struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")

Parse json RDD into dataframe with Pyspark

I am new to Pyspark. From the code below I want to create a spark dataframe. It is difficult to parse it the correct way.
How to parse it in a dataframe the right way?
How can I parse it and get the following output?
/
/
Desired output:
date_added| price|
+--------------------+--------------------+
| 2020-11-01| 10000|
The code:
conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'symbol': 'BTC',
'convert':'JPY'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '***********************',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
json_rdd = sc.parallelize([response.text])
#data = json.loads(response.text)
#print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()
The output dataframe:
| data| status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|
JSON schema:
root
|-- data: struct (nullable = true)
| |-- BTC: struct (nullable = true)
| | |-- circulating_supply: long (nullable = true)
| | |-- cmc_rank: long (nullable = true)
| | |-- date_added: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_active: long (nullable = true)
| | |-- is_fiat: long (nullable = true)
| | |-- last_updated: string (nullable = true)
| | |-- max_supply: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- num_market_pairs: long (nullable = true)
| | |-- platform: string (nullable = true)
| | |-- quote: struct (nullable = true)
| | | |-- JPY: struct (nullable = true)
| | | | |-- last_updated: string (nullable = true)
| | | | |-- market_cap: double (nullable = true)
| | | | |-- percent_change_1h: double (nullable = true)
| | | | |-- percent_change_24h: double (nullable = true)
| | | | |-- percent_change_7d: double (nullable = true)
| | | | |-- price: double (nullable = true)
| | | | |-- volume_24h: double (nullable = true)
| | |-- slug: string (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- total_supply: long (nullable = true)
|-- status: struct (nullable = true)
| |-- credit_count: long (nullable = true)
| |-- elapsed: long (nullable = true)
| |-- error_code: long (nullable = true)
| |-- error_message: string (nullable = true)
| |-- notice: string (nullable = true)
| |-- timestamp: string (nullable = true)

It looks like you've parsed it correctly. You can access the nested elements using the dot notation:
json_df.select(
F.col('data.BTC.date_added').alias('date_added'),
F.col('data.BTC.quote.JPY.price').alias('price')
)

How do I turn a list of JSON objects into a Spark dataframe in Code Workbook?

How can I turn this list of JSON objects into a Spark dataframe?
[
{
'1': 'A',
'2': 'B'
},
{
'1': 'A',
'3': 'C'
}
]
into
1 2 3
A B null
A null C
I've tried spark.read.json(spark.sparkContext.parallelize(d)) and various combinations of that with json.dumps(d).

I had to slay this dragon to import JIRA issues. They came back as a dataset of response objects, each containing an inner array of issue JSON objects.
This code worked as a single transformation to get to the properly-parsed JSON objects in a DataFrame:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
Schema is too big to show, but here's a snippet:
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- percent: long (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: long (nullable = true)
| |-- aggregatetimeoriginalestimate: long (nullable = true)
| |-- aggregatetimespent: long (nullable = true)
| |-- assignee: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- description: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10000: string (nullable = true)
| |-- customfield_10001: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isShared: boolean (nullable = true)
| | |-- title: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: struct (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- issueType: struct (nullable = true)
| | | | |-- iconUrl: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- keyNum: long (nullable = true)
| | | |-- projectId: long (nullable = true)
| | | |-- summary: string (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10009: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- boardId: long (nullable = true)
| | | |-- completeDate: string (nullable = true)
| | | |-- endDate: string (nullable = true)
| | | |-- goal: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- startDate: string (nullable = true)
| | | |-- state: string (nullable = true)
...

You can use spark.createDataFrame(d) to get the desired effect.
You do get a deprecation warning about inferring schema from dictionaries, so the "right" way to do this is to first create the rows:
from pyspark.sql import Row
data = [{'1': 'A', '2': 'B'}, {'1': 'A', '3': 'C'}]
schema = ['1', '2', '3']
rows = []
for d in data:
dict_for_row = {k: d.get(k,None) for k in schema}
rows.append(Row(**dict_for_row))
then create the DataFrame:
df = spark.createDataFrame(row)

Convert to JSON format expected by Spark for creating schema for dataframe in Java

I have test JSON data at following link
http://developer.trade.gov/api/market-research-library.json
When I am trying to read schema directly from it in following manner
public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}
It is giving me error as
root
|-- _corrupt_record: string (nullable = true)
If it is malformed, how to convert it into format as expected by Spark.

Converting your json to single line.
Or set option("multiLine", true) to allow multiply line json.

If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.
Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.
sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))
You should have your required dataframe with following schema
root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
| |-- Report: struct (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- click_url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- country: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- description: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- expiration_date: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- id: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- report_type: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- source_industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- title: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
| |-- /market_research_library/search: struct (nullable = true)
| | |-- get: struct (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- parameters: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- in: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- required: boolean (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | |-- responses: struct (nullable = true)
| | | | |-- 200: struct (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- schema: struct (nullable = true)
| | | | | | |-- items: struct (nullable = true)
| | | | | | | |-- $ref: string (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
| |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)

The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:
import jsonlines
import json
with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)
Then you can access the file in your program as you have written in your code.

How to read the json file in spark using scala?

I want to read the JSON file in the below format:-
{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}
I am writing my read line as:
sqlContext.read.json("user/files_fold/testing-data.json").printSchema
But I not getting the desired result-
root
|-- _corrupt_record: string (nullable = true)
Please help me on this

I suggest using wholeTextFiles to read the file and apply some functions to convert it to a single-line JSON format.
val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
map(tuple => tuple._2.replace("\n", "").trim)
val df = sqlContext.read.json(json)
You should have the final valid dataframe as
+--------------------------------------------------------------------------------------------------------+---------+
|atom |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+
And valid schema as
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)

Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")

It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:
val json =
"""
|{
| "titlename": "periodic",
| "atom": [
| {
| "usage": "neutron",
| "dailydata": [
| {
| "utcacquisitiontime": "2017-03-27T22:00:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 28128,
| "intervaltime": 15
| },
| {
| "utcacquisitiontime": "2017-03-27T22:15:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 25687,
| "intervaltime": 15
| }
| ]
| }
| ]
|}
""".stripMargin
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
.json(spark.sparkContext.parallelize(Seq(json)))
.printSchema()

From the Apache Spark SQL Docs
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.
Thus,
{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}
And then:
val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame =
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>,
titlename: string]

This has already been answered nicely by other contributors, but I had one question which is how do i access each nested value/unit of the dataframe.
So, for collections, we can use explode and for struct types we can directly call the unit by dot(.).
scala> val a = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("file:///home/hdfs/spark_2.json")
a: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string]
scala> a.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
scala> val b = a.withColumn("exploded_atom", explode(col("atom")))
b: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 1 more field]
scala> b.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
scala>
scala> val c = b.withColumn("exploded_atom_struct", explode(col("`exploded_atom`.dailydata")))
c: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 2 more fields]
scala>
scala> c.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
scala> val d = c.withColumn("exploded_atom_struct_last", col("`exploded_atom_struct`.utcacquisitiontime"))
d: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 3 more fields]
scala> d.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
|-- exploded_atom_struct_last: string (nullable = true)
scala> val d = c.select(col("titlename"), col("exploded_atom_struct.*"))
d: org.apache.spark.sql.DataFrame = [titlename: string, datatimezone: string ... 3 more fields]
scala> d.show
+---------+------------+------------+-------------+--------------------+
|titlename|datatimezone|intervaltime|intervalvalue| utcacquisitiontime|
+---------+------------+------------+-------------+--------------------+
| periodic| +02:00| 15| 28128|2017-03-27T22:00:00Z|
| periodic| +02:00| 15| 25687|2017-03-27T22:15:00Z|
+---------+------------+------------+-------------+--------------------+
So thought of posting it here, in case if anyone has similar questions seeing this question.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parse Nested JSON in Spark - json

Related

Update nested struct in spark dataset from another struct column

Parse json RDD into dataframe with Pyspark

How do I turn a list of JSON objects into a Spark dataframe in Code Workbook?

Convert to JSON format expected by Spark for creating schema for dataframe in Java

How to read the json file in spark using scala?

Categories

Resources