Transform ARRAY STRUCT Json to CSV - csv

What is the solution for described: I have a parquete table that contains arrays and structs.
root
|-- Loaded: long (nullable = true)
|-- article: string (nullable = true)
|-- brand: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- uRL: string (nullable = true)
|-- breadcrumbs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- uRL: string (nullable = true)
|-- characteristics: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- value: string (nullable = true)
|-- competitorKey: string (nullable = true)
|-- description: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- descriptionFull: string (nullable = true)
|-- groupId: string (nullable = true)
|-- groupedProducts: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: string (nullable = true)
|-- keywords: array (nullable = true)
| |-- element: string (containsNull = true)
|-- labels: array (nullable = true)
| |-- element: string (containsNull = true)
|-- merchant: struct (nullable = true)
| |-- created: long (nullable = true)
| |-- extraId: string (nullable = true)
| |-- id: string (nullable = true)
| |-- isNaturalPerson: boolean (nullable = true)
| |-- legalAddress: string (nullable = true)
| |-- legalINN: string (nullable = true)
| |-- legalName: string (nullable = true)
| |-- legalOGRN: string (nullable = true)
| |-- name: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- rating: double (nullable = true)
| |-- ratingCount: long (nullable = true)
| |-- uRL: string (nullable = true)
|-- merchants: array (nullable = true)
| |-- element: string (containsNull = true)
|-- name: string (nullable = true)
|-- questionsCount: long (nullable = true)
|-- rating: double (nullable = true)
|-- ratingCount: long (nullable = true)
|-- region: string (nullable = true)
|-- reviewsCount: long (nullable = true)
|-- sold2M: long (nullable = true)
|-- taskId: string (nullable = true)
|-- uRL: string (nullable = true)
|-- variations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- article: string (nullable = true)
| | |-- barcode: string (nullable = true)
| | |-- color: string (nullable = true)
| | |-- fbs: boolean (nullable = true)
| | |-- id: string (nullable = true)
| | |-- images: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- inventory: long (nullable = true)
| | |-- isExpress: boolean (nullable = true)
| | |-- isbn: string (nullable = true)
| | |-- msrp: double (nullable = true)
| | |-- multiProduct: boolean (nullable = true)
| | |-- offers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- otherSellers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- promo: double (nullable = true)
| | |-- properties: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | |-- size: string (nullable = true)
| | |-- skuId: string (nullable = true)
| | |-- sold: long (nullable = true)
| | |-- uRL: string (nullable = true)
| | |-- videos: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- viewsCount: long (nullable = true)
I need to prepare the csv file from this.
The required fields are "region, id, article, name, url, variations(msrp, price, brand,breadcrumbs), merchant(id, name, legalName. legalINN, legalOGRN, phone), rating, ratingCount, reviewsCount, questionsCountr, viersCount, sold2M"
I know how to df.select("region") for example, but I do not know how to read msrp, price from "variations" array and all from "brand" struct and other cases, when columns are inside array/struct.
How can I transform this data? Thank you!
Data
I have tried
df = df.select('region','id', 'article', 'name', 'url', F.col('variations').getItem('element').getItem('msrp').cast('string'))
but it did not work
Error: "No such struct field element in id, skuId, uRL, article, isbn, barcode, msrp, price, promo, multiProduct, inventory, whInventories, images, videos, size, color, sold, otherSellers, properties, offers, isExpress, fbs"

Related

Update nested struct in spark dataset from another struct column

I have the following spark dataset with a nested struct type:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- customer_servicesegment: string (nullable = true)
| |-- customer_category: string (nullable = true)
my aim here is to do some flattening at the bottom of the structype and have this target schema:
-- _1: struct (nullable = false)
| |-- _1: struct (nullable = false)
| | |-- _1: struct (nullable = false)
| | | |-- ni_number: string (nullable = true)
| | | |-- national_registration_number: string (nullable = true)
| | | |-- id_issuing_country: string (nullable = true)
| | | |-- doc_type_name: string (nullable = true)
| | | |-- brand: string (nullable = true)
| | | |-- company_name: string (nullable = true)
| | |-- _2: struct (nullable = true)
| | | |-- municipality: string (nullable = true)
| | | |-- country: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- brand_name: string (nullable = true)
| | |-- puk: string (nullable = true)
| |-- _3: struct (nullable = true)
| | |-- customer_servicesegment: string (nullable = true)
| | |-- customer_category: string (nullable = true)
the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)
So here explode utility from spark sql can be used but I don't know where to put it
any help with this please
If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:
val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")
This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.
For older versions, you need to reconstruct the struct column _1:
val df2 = df.withColumn(
"_1",
struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")

Parse json RDD into dataframe with Pyspark

I am new to Pyspark. From the code below I want to create a spark dataframe. It is difficult to parse it the correct way.
How to parse it in a dataframe the right way?
How can I parse it and get the following output?
/
/
Desired output:
date_added| price|
+--------------------+--------------------+
| 2020-11-01| 10000|
The code:
conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'symbol': 'BTC',
'convert':'JPY'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '***********************',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
json_rdd = sc.parallelize([response.text])
#data = json.loads(response.text)
#print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()
The output dataframe:
| data| status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|
JSON schema:
root
|-- data: struct (nullable = true)
| |-- BTC: struct (nullable = true)
| | |-- circulating_supply: long (nullable = true)
| | |-- cmc_rank: long (nullable = true)
| | |-- date_added: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_active: long (nullable = true)
| | |-- is_fiat: long (nullable = true)
| | |-- last_updated: string (nullable = true)
| | |-- max_supply: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- num_market_pairs: long (nullable = true)
| | |-- platform: string (nullable = true)
| | |-- quote: struct (nullable = true)
| | | |-- JPY: struct (nullable = true)
| | | | |-- last_updated: string (nullable = true)
| | | | |-- market_cap: double (nullable = true)
| | | | |-- percent_change_1h: double (nullable = true)
| | | | |-- percent_change_24h: double (nullable = true)
| | | | |-- percent_change_7d: double (nullable = true)
| | | | |-- price: double (nullable = true)
| | | | |-- volume_24h: double (nullable = true)
| | |-- slug: string (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- total_supply: long (nullable = true)
|-- status: struct (nullable = true)
| |-- credit_count: long (nullable = true)
| |-- elapsed: long (nullable = true)
| |-- error_code: long (nullable = true)
| |-- error_message: string (nullable = true)
| |-- notice: string (nullable = true)
| |-- timestamp: string (nullable = true)
It looks like you've parsed it correctly. You can access the nested elements using the dot notation:
json_df.select(
F.col('data.BTC.date_added').alias('date_added'),
F.col('data.BTC.quote.JPY.price').alias('price')
)

How do I turn a list of JSON objects into a Spark dataframe in Code Workbook?

How can I turn this list of JSON objects into a Spark dataframe?
[
{
'1': 'A',
'2': 'B'
},
{
'1': 'A',
'3': 'C'
}
]
into
1 2 3
A B null
A null C
I've tried spark.read.json(spark.sparkContext.parallelize(d)) and various combinations of that with json.dumps(d).
I had to slay this dragon to import JIRA issues. They came back as a dataset of response objects, each containing an inner array of issue JSON objects.
This code worked as a single transformation to get to the properly-parsed JSON objects in a DataFrame:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
Schema is too big to show, but here's a snippet:
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- percent: long (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: long (nullable = true)
| |-- aggregatetimeoriginalestimate: long (nullable = true)
| |-- aggregatetimespent: long (nullable = true)
| |-- assignee: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- description: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10000: string (nullable = true)
| |-- customfield_10001: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isShared: boolean (nullable = true)
| | |-- title: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: struct (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- issueType: struct (nullable = true)
| | | | |-- iconUrl: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- keyNum: long (nullable = true)
| | | |-- projectId: long (nullable = true)
| | | |-- summary: string (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10009: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- boardId: long (nullable = true)
| | | |-- completeDate: string (nullable = true)
| | | |-- endDate: string (nullable = true)
| | | |-- goal: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- startDate: string (nullable = true)
| | | |-- state: string (nullable = true)
...
You can use spark.createDataFrame(d) to get the desired effect.
You do get a deprecation warning about inferring schema from dictionaries, so the "right" way to do this is to first create the rows:
from pyspark.sql import Row
data = [{'1': 'A', '2': 'B'}, {'1': 'A', '3': 'C'}]
schema = ['1', '2', '3']
rows = []
for d in data:
dict_for_row = {k: d.get(k,None) for k in schema}
rows.append(Row(**dict_for_row))
then create the DataFrame:
df = spark.createDataFrame(row)

Parsing JSON in a Spark column

I have a JSON in a dataframe column that is of type String, and I want to convert that to a map. The catch here is that I don't exactly know the schema of the JSON, since the key name can vary.
Basically, my JSON column looks like:
{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}
I want this to eventually look like:
{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"},
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"},
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}
so that I can calculate mean of all prices when type="DYN".
In other words, reading the JSON data using this:
val testJsonData = spark.read.json("file:///data/json_example")
gives me the following schema:
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
However, I'd like to end up with the much simpler:
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
What transformation can I use on the data to be able to end up with the above schema?
Please let me know the easiest way to do this. Thanks in advance!
Your requirement is a bit complex and if my edit of your question is correct then following can be your solution.
you already have input dataframe with schema as
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
The following step would change the dataframe to have each array in different columns
val tempT = testJsonData.select($"outerkey.*")
whose schema would be
root
|-- innerkey_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
Since you want the keynames in each structs, you need to get the names
val schema = tempT.schema.fieldNames
Thus schema would be innerkey_1, innerkey_2, innerkey_3
Complex part is to add the keynames inside struct columns which would require two for loops
import org.apache.spark.sql.functions._
for(column <- schema){
tt = tt.withColumn(column, explode($"${column}"))
}
for(column <- schema){
tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}
finally tt would have keynames added in the struct columns as
root
|-- innerkey_1: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_2: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_3: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
Final step would be combine all of them into one column, opposite of what we did in the first step
val temp = tt.select(array(schema.map(col): _*).as("outerkey"))
temp's schema would be your required schema
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
and temp.toJSON.foreach(x => println(x.toString)) should give you your desired json data
{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}

How to read the json file in spark using scala?

I want to read the JSON file in the below format:-
{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}
I am writing my read line as:
sqlContext.read.json("user/files_fold/testing-data.json").printSchema
But I not getting the desired result-
root
|-- _corrupt_record: string (nullable = true)
Please help me on this
I suggest using wholeTextFiles to read the file and apply some functions to convert it to a single-line JSON format.
val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
map(tuple => tuple._2.replace("\n", "").trim)
val df = sqlContext.read.json(json)
You should have the final valid dataframe as
+--------------------------------------------------------------------------------------------------------+---------+
|atom |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+
And valid schema as
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:
val json =
"""
|{
| "titlename": "periodic",
| "atom": [
| {
| "usage": "neutron",
| "dailydata": [
| {
| "utcacquisitiontime": "2017-03-27T22:00:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 28128,
| "intervaltime": 15
| },
| {
| "utcacquisitiontime": "2017-03-27T22:15:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 25687,
| "intervaltime": 15
| }
| ]
| }
| ]
|}
""".stripMargin
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
.json(spark.sparkContext.parallelize(Seq(json)))
.printSchema()
From the Apache Spark SQL Docs
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.
Thus,
{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}
And then:
val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame =
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>,
titlename: string]
This has already been answered nicely by other contributors, but I had one question which is how do i access each nested value/unit of the dataframe.
So, for collections, we can use explode and for struct types we can directly call the unit by dot(.).
scala> val a = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("file:///home/hdfs/spark_2.json")
a: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string]
scala> a.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
scala> val b = a.withColumn("exploded_atom", explode(col("atom")))
b: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 1 more field]
scala> b.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
scala>
scala> val c = b.withColumn("exploded_atom_struct", explode(col("`exploded_atom`.dailydata")))
c: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 2 more fields]
scala>
scala> c.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
scala> val d = c.withColumn("exploded_atom_struct_last", col("`exploded_atom_struct`.utcacquisitiontime"))
d: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 3 more fields]
scala> d.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
|-- exploded_atom_struct_last: string (nullable = true)
scala> val d = c.select(col("titlename"), col("exploded_atom_struct.*"))
d: org.apache.spark.sql.DataFrame = [titlename: string, datatimezone: string ... 3 more fields]
scala> d.show
+---------+------------+------------+-------------+--------------------+
|titlename|datatimezone|intervaltime|intervalvalue| utcacquisitiontime|
+---------+------------+------------+-------------+--------------------+
| periodic| +02:00| 15| 28128|2017-03-27T22:00:00Z|
| periodic| +02:00| 15| 25687|2017-03-27T22:15:00Z|
+---------+------------+------------+-------------+--------------------+
So thought of posting it here, in case if anyone has similar questions seeing this question.