Parse json RDD into dataframe with Pyspark - json

I am new to Pyspark. From the code below I want to create a spark dataframe. It is difficult to parse it the correct way.
How to parse it in a dataframe the right way?
How can I parse it and get the following output?
/
/
Desired output:
date_added| price|
+--------------------+--------------------+
| 2020-11-01| 10000|
The code:
conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'symbol': 'BTC',
'convert':'JPY'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '***********************',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
json_rdd = sc.parallelize([response.text])
#data = json.loads(response.text)
#print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()
The output dataframe:
| data| status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|
JSON schema:
root
|-- data: struct (nullable = true)
| |-- BTC: struct (nullable = true)
| | |-- circulating_supply: long (nullable = true)
| | |-- cmc_rank: long (nullable = true)
| | |-- date_added: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_active: long (nullable = true)
| | |-- is_fiat: long (nullable = true)
| | |-- last_updated: string (nullable = true)
| | |-- max_supply: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- num_market_pairs: long (nullable = true)
| | |-- platform: string (nullable = true)
| | |-- quote: struct (nullable = true)
| | | |-- JPY: struct (nullable = true)
| | | | |-- last_updated: string (nullable = true)
| | | | |-- market_cap: double (nullable = true)
| | | | |-- percent_change_1h: double (nullable = true)
| | | | |-- percent_change_24h: double (nullable = true)
| | | | |-- percent_change_7d: double (nullable = true)
| | | | |-- price: double (nullable = true)
| | | | |-- volume_24h: double (nullable = true)
| | |-- slug: string (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- total_supply: long (nullable = true)
|-- status: struct (nullable = true)
| |-- credit_count: long (nullable = true)
| |-- elapsed: long (nullable = true)
| |-- error_code: long (nullable = true)
| |-- error_message: string (nullable = true)
| |-- notice: string (nullable = true)
| |-- timestamp: string (nullable = true)

It looks like you've parsed it correctly. You can access the nested elements using the dot notation:
json_df.select(
F.col('data.BTC.date_added').alias('date_added'),
F.col('data.BTC.quote.JPY.price').alias('price')
)

Related

Update nested struct in spark dataset from another struct column

I have the following spark dataset with a nested struct type:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- customer_servicesegment: string (nullable = true)
| |-- customer_category: string (nullable = true)
my aim here is to do some flattening at the bottom of the structype and have this target schema:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
| |-- _3: struct (nullable = true)
| | |-- customer_servicesegment: string (nullable = true)
| | |-- customer_category: string (nullable = true)
the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)
So here explode utility from spark sql can be used but I don't know where to put it
any help with this please
If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:
val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")
This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.
For older versions, you need to reconstruct the struct column _1:
val df2 = df.withColumn(
"_1",
struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")

How do I turn a list of JSON objects into a Spark dataframe in Code Workbook?

How can I turn this list of JSON objects into a Spark dataframe?
[
{
'1': 'A',
'2': 'B'
},
{
'1': 'A',
'3': 'C'
}
]
into
1 2 3
A B null
A null C
I've tried spark.read.json(spark.sparkContext.parallelize(d)) and various combinations of that with json.dumps(d).
I had to slay this dragon to import JIRA issues. They came back as a dataset of response objects, each containing an inner array of issue JSON objects.
This code worked as a single transformation to get to the properly-parsed JSON objects in a DataFrame:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
Schema is too big to show, but here's a snippet:
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- percent: long (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: long (nullable = true)
| |-- aggregatetimeoriginalestimate: long (nullable = true)
| |-- aggregatetimespent: long (nullable = true)
| |-- assignee: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- description: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10000: string (nullable = true)
| |-- customfield_10001: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isShared: boolean (nullable = true)
| | |-- title: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: struct (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- issueType: struct (nullable = true)
| | | | |-- iconUrl: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- keyNum: long (nullable = true)
| | | |-- projectId: long (nullable = true)
| | | |-- summary: string (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10009: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- boardId: long (nullable = true)
| | | |-- completeDate: string (nullable = true)
| | | |-- endDate: string (nullable = true)
| | | |-- goal: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- startDate: string (nullable = true)
| | | |-- state: string (nullable = true)
...
You can use spark.createDataFrame(d) to get the desired effect.
You do get a deprecation warning about inferring schema from dictionaries, so the "right" way to do this is to first create the rows:
from pyspark.sql import Row
data = [{'1': 'A', '2': 'B'}, {'1': 'A', '3': 'C'}]
schema = ['1', '2', '3']
rows = []
for d in data:
dict_for_row = {k: d.get(k,None) for k in schema}
rows.append(Row(**dict_for_row))
then create the DataFrame:
df = spark.createDataFrame(row)

Parsing JSON in a Spark column

I have a JSON in a dataframe column that is of type String, and I want to convert that to a map. The catch here is that I don't exactly know the schema of the JSON, since the key name can vary.
Basically, my JSON column looks like:
{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}
I want this to eventually look like:
{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"},
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"},
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}
so that I can calculate mean of all prices when type="DYN".
In other words, reading the JSON data using this:
val testJsonData = spark.read.json("file:///data/json_example")
gives me the following schema:
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
However, I'd like to end up with the much simpler:
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
What transformation can I use on the data to be able to end up with the above schema?
Please let me know the easiest way to do this. Thanks in advance!
Your requirement is a bit complex and if my edit of your question is correct then following can be your solution.
you already have input dataframe with schema as
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
The following step would change the dataframe to have each array in different columns
val tempT = testJsonData.select($"outerkey.*")
whose schema would be
root
|-- innerkey_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
Since you want the keynames in each structs, you need to get the names
val schema = tempT.schema.fieldNames
Thus schema would be innerkey_1, innerkey_2, innerkey_3
Complex part is to add the keynames inside struct columns which would require two for loops
import org.apache.spark.sql.functions._
for(column <- schema){
tt = tt.withColumn(column, explode($"${column}"))
}
for(column <- schema){
tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}
finally tt would have keynames added in the struct columns as
root
|-- innerkey_1: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_2: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_3: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
Final step would be combine all of them into one column, opposite of what we did in the first step
val temp = tt.select(array(schema.map(col): _*).as("outerkey"))
temp's schema would be your required schema
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
and temp.toJSON.foreach(x => println(x.toString)) should give you your desired json data
{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}

Convert to JSON format expected by Spark for creating schema for dataframe in Java

I have test JSON data at following link
http://developer.trade.gov/api/market-research-library.json
When I am trying to read schema directly from it in following manner
public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}
It is giving me error as
root
|-- _corrupt_record: string (nullable = true)
If it is malformed, how to convert it into format as expected by Spark.
Converting your json to single line.
Or set option("multiLine", true) to allow multiply line json.
If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.
Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.
sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))
You should have your required dataframe with following schema
root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
| |-- Report: struct (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- click_url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- country: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- description: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- expiration_date: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- id: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- report_type: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- source_industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- title: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
| |-- /market_research_library/search: struct (nullable = true)
| | |-- get: struct (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- parameters: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- in: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- required: boolean (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | |-- responses: struct (nullable = true)
| | | | |-- 200: struct (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- schema: struct (nullable = true)
| | | | | | |-- items: struct (nullable = true)
| | | | | | | |-- $ref: string (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
| |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)
The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:
import jsonlines
import json
with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)
Then you can access the file in your program as you have written in your code.

How to read the json file in spark using scala?

I want to read the JSON file in the below format:-
{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}
I am writing my read line as:
sqlContext.read.json("user/files_fold/testing-data.json").printSchema
But I not getting the desired result-
root
|-- _corrupt_record: string (nullable = true)
Please help me on this
I suggest using wholeTextFiles to read the file and apply some functions to convert it to a single-line JSON format.
val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
map(tuple => tuple._2.replace("\n", "").trim)
val df = sqlContext.read.json(json)
You should have the final valid dataframe as
+--------------------------------------------------------------------------------------------------------+---------+
|atom |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+
And valid schema as
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:
val json =
"""
|{
| "titlename": "periodic",
| "atom": [
| {
| "usage": "neutron",
| "dailydata": [
| {
| "utcacquisitiontime": "2017-03-27T22:00:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 28128,
| "intervaltime": 15
| },
| {
| "utcacquisitiontime": "2017-03-27T22:15:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 25687,
| "intervaltime": 15
| }
| ]
| }
| ]
|}
""".stripMargin
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
.json(spark.sparkContext.parallelize(Seq(json)))
.printSchema()
From the Apache Spark SQL Docs
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.
Thus,
{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}
And then:
val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame =
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>,
titlename: string]
This has already been answered nicely by other contributors, but I had one question which is how do i access each nested value/unit of the dataframe.
So, for collections, we can use explode and for struct types we can directly call the unit by dot(.).
scala> val a = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("file:///home/hdfs/spark_2.json")
a: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string]
scala> a.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
scala> val b = a.withColumn("exploded_atom", explode(col("atom")))
b: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 1 more field]
scala> b.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
scala>
scala> val c = b.withColumn("exploded_atom_struct", explode(col("`exploded_atom`.dailydata")))
c: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 2 more fields]
scala>
scala> c.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
scala> val d = c.withColumn("exploded_atom_struct_last", col("`exploded_atom_struct`.utcacquisitiontime"))
d: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 3 more fields]
scala> d.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
|-- exploded_atom_struct_last: string (nullable = true)
scala> val d = c.select(col("titlename"), col("exploded_atom_struct.*"))
d: org.apache.spark.sql.DataFrame = [titlename: string, datatimezone: string ... 3 more fields]
scala> d.show
+---------+------------+------------+-------------+--------------------+
|titlename|datatimezone|intervaltime|intervalvalue| utcacquisitiontime|
+---------+------------+------------+-------------+--------------------+
| periodic| +02:00| 15| 28128|2017-03-27T22:00:00Z|
| periodic| +02:00| 15| 25687|2017-03-27T22:15:00Z|
+---------+------------+------------+-------------+--------------------+
So thought of posting it here, in case if anyone has similar questions seeing this question.