I'm trying to set up my Mercurial repository system to work with multiple subrepos. I've basically followed these instructions to set up the client repo with Mercurial client v1.5 and I'm using HgWebDir to host my multiple projects.
I have an HgWebDir with the following structure:
http://myserver/hg
|-- fooproj
|-- mylib
where mylib is some collection of common template library to be consumed by fooproj. The structure of fooproj looks like this:
fooproj
|-- doc/
| `-- readme
|-- src/
| `-- main.cpp
|-- .hgignore
|-- .hgsub
`-- .hgsubstate
And .hgsub looks like:
src/mylib = http://myserver/hg/mylib
This should work, per my interpretation of the documentation:
The first 'nested' is the path in our
working dir, and the second is a URL
or path to pull from.
Also, the mylib project directory structure looks like this:
mylib
|-- .hg
| |-- 00changelog.i
| |-- dirstate
| |-- requires
| |-- store
| | |-- 00changelog.i
| | |-- 00manifest.i
| | | |-- data
| | | | ` magic.h.i
| | |-- fncache
| | `-- undo
| |-- undo.branch
| `-- undo.dirstate
`-- magic.h
So, let's say I pull down fooproj to my home folder with:
~$ hg clone http://myserver/hg/fooproj foo
Which pulls down the directory structure properly and adds the folder ~/foo/src/mylib which is a local Mercurial repository. This is where the problems begin: the mylib folder is empty aside from the items in .hg. The messages from Mercurial are:
requesting all changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 5 changes to 5 files
updating working directory
5 files updated, 0 files merged, 0 files removed, 0 files unresolved
With 2 seconds of investigation, one can see the src/mylib/.hg/hgrc is:
[paths]
default = http://myserver/hg/fooproj/src/mylib
which is completely wrong (attempting a pull of that repo will give a 404 because, well, that URL doesn't make any sense).
foo
|-- .hg
| |-- 00changelog.i
| |-- branch
| |-- branchheads.cache
| |-- dirstate
| |-- hgrc
| |-- requires
| |-- store
| | |-- 00changelog.i
| | |-- 00manifest.i
| | |-- data
| | | |-- .hgignore.i
| | | |-- .hgsub.i
| | | |-- .hgsubstate.i
| | | |-- doc
| | | | `-- readme.i
| | | `-- src
| | | `-- main.cpp.i
| | |-- fncache
| | `-- undo
| |-- tags.cache
| |-- undo.branch
| `-- undo.dirstate
|-- .hgignore
|-- .hgsub
|-- .hgsubstate
|-- doc
| `-- readme
`-- src
|-- main.cpp
`-- mylib
`-- .hg
|-- 00changelog.i
|-- branch
|-- dirstate
|-- hgrc
|-- requires
`-- store
Logically, the default value should be what I specified in .hgsub or it would get the files from the repository in some way. Of course, changing src/mylib/.hg/hgrc to:
[paths]
default = http://myserver/hg/mylib
and running hg pull && hg update works perfectly. Of course, this is basically the same thing as not using subrepos in the first place.
None of the Mercurial commands return error codes (aside from a pull from within src/mylib), so it clearly believes that it is behaving properly (and just might be), although this does not seem logical at all.
What am I doing wrong?
The ultimate problem might be that .hgsubstate will always look like:
0000000000000000000000000000000000000000 src/mylib
But I have no idea how to fix that...
The left-hand-side path in your .hgsub file is relative to its location in your tree. It's already down in src, so src doesn't need to be in the path. I think if you make the .hgsub file look like:
mylib = http://myserver/hg/mylib
and leave it where it is you'll get what you want. Alternately, you could move the location of .hgsub up a directory (outside of src, in your root) and then it would be correct as it is now.
I've just confirmed this interpretation with a setup like this:
.
|-- .hg
| |-- 00changelog.i
| |-- branch
| |-- branchheads.cache
| |-- dirstate
| |-- last-message.txt
| |-- requires
| |-- store
| | |-- 00changelog.i
| | |-- 00manifest.i
| | |-- data
| | | |-- .hgsub.i
| | | `-- .hgsubstate.i
| | |-- fncache
| | `-- undo
| |-- undo.branch
| `-- undo.dirstate
|-- .hgsub
|-- .hgsubstate
`-- src
`-- mylib
|-- .hg
| |-- 00changelog.i
| |-- branch
| |-- branchheads.cache
| |-- dirstate
| |-- hgrc
| |-- last-message.txt
| |-- requires
| |-- store
| | |-- 00changelog.i
| | |-- 00manifest.i
| | |-- data
| | | |-- .hgignore.i
| | | |-- _p_k_g-_i_n_f_o.i
| | | |-- _r_e_a_d_m_e.i
| | | |-- hgext
| | | | `-- chart.py.i
| | | `-- setup.py.i
| | |-- fncache
| | `-- undo
| |-- tags.cache
| |-- undo.branch
| `-- undo.dirstate
|-- .hgignore
|-- PKG-INFO
|-- README
|-- hgext
| `-- chart.py
`-- setup.py
Where that top level .hgsub file contains:
$ cat .hgsub
src/mylib = https://Ry4an#bitbucket.org/Ry4an/hg-chart-extension/
and a cone done of the parent shows it cloning the child too:
$ hg clone parent parent-clone
updating to branch default
pulling subrepo src/mylib
requesting all changes
adding changesets
adding manifests
adding file changes
added 8 changesets with 14 changes to 5 files
Related
I have a JSON file whose schema is like this--
root
|-- errorcode: string (nullable = true)
|-- errormessage: string (nullable = true)
|-- ip: string (nullable = true)
|-- label: string (nullable = true)
|-- status: string (nullable = true)
|-- storageidlist: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- errorcode: string (nullable = true)
| | |-- errormessage: string (nullable = true)
| | |-- fedirectorList: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- directorId: string (nullable = true)
| | | | |-- errorcode: string (nullable = true)
| | | | |-- errordesc: string (nullable = true)
| | | | |-- metrics: string (nullable = true)
| | | | |-- portMetricDataList: array (nullable = true)
| | | | | |-- element: array (containsNull = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- data: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- ts: string (nullable = true)
| | | | | | | | | |-- value: string (nullable = true)
| | | | | | | |-- errorcode: string (nullable = true)
| | | | | | | |-- errordesc: string (nullable = true)
| | | | | | | |-- metricid: string (nullable = true)
| | | | | | | |-- portid: string (nullable = true)
| | | | | | | |-- status: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | |-- metrics: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- storageGroupList: string (nullable = true)
| | |-- storageid: string (nullable = true)
|-- sublabel: string (nullable = true)
|-- ts: string (nullable = true)
I am supposed to extract ip,storageid,directorid,metricid,value and ts. In the storageidlist, there is just 1 item, but in the fedirectorList, there are 56 items. But I am unable to parse the JSON beyond storageidlist.
scala> val ip_df = spark.read.option("multiline",true).json("FEDirector_port_data.txt")
ip_df: org.apache.spark.sql.DataFrame = [errorcode: string, errormessage: string ... 6 more fields]
scala> ip_df.select($"storageidlist.storageid").show()
+--------------+
| storageid|
+--------------+
|[000295700670]|
+--------------+
scala> ip_df.select($"storageidlist.fedirectorList.directorId").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`storageidlist`.`fedirectorList`['directorId']' due to data type mismatch: argument 2 requires integral type, however, ''directorId'' is of string type.;;
storageidlist is an array column, so you need to select the first array element and do further selections from that:
ip_df.selectExpr("storageidlist[0].fedirectorList.directorId")
or
ip_df.select($"storageidlist"(0).getField("fedirectorList").getField("directorId"))
It's better to specify an array index whenever you work with array type columns. If you don't specify an array index, you can go 1 level deeper and fetch all the corresponding struct elements in the next level, but you can't go further, as shown in your question.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
file = "<s3path>/<json_file_name.json>"
schema_path = "<s3path>/<json_schame_name.json>"
json_schema = spark.read.json(schema_path, multiLine=True)
df = sqlContext.read.json(file,json_schema.json_schema,multiLine=True)
#display(df)
df.createOrReplaceTempView("temptable")
#example UDF
def parse_nested_list(nested_list):
parsed_str = []
if nested_list:
for item_list in nested_list:
if item_list:
for item in item_list:
if item:
parsed_str.append(item)
return "|".join(parsed_str)
def parse_arrs(x):
if x:
return "| ".join(
", ".join(i for i in e if i is not None) for e in x if e is not None
)
else:
""
sqlContext.udf.register("parse_nested_list", parse_nested_list)
sqlContext.udf.register("parse_arrs", parse_arrs)
structured_df =sqlContext.sql("select parse_nested_list(column1.column2) as column3, parse_arrs(column1) as column2 from temptable")
display(structured_df)
To fetch the nested array, list, dictionaries. You will have to write a UDF to get the nested values and register it to pyspark so that you can use them in sparksql coding.
I am new to Pyspark. From the code below I want to create a spark dataframe. It is difficult to parse it the correct way.
How to parse it in a dataframe the right way?
How can I parse it and get the following output?
/
/
Desired output:
date_added| price|
+--------------------+--------------------+
| 2020-11-01| 10000|
The code:
conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'symbol': 'BTC',
'convert':'JPY'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '***********************',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
json_rdd = sc.parallelize([response.text])
#data = json.loads(response.text)
#print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()
The output dataframe:
| data| status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|
JSON schema:
root
|-- data: struct (nullable = true)
| |-- BTC: struct (nullable = true)
| | |-- circulating_supply: long (nullable = true)
| | |-- cmc_rank: long (nullable = true)
| | |-- date_added: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_active: long (nullable = true)
| | |-- is_fiat: long (nullable = true)
| | |-- last_updated: string (nullable = true)
| | |-- max_supply: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- num_market_pairs: long (nullable = true)
| | |-- platform: string (nullable = true)
| | |-- quote: struct (nullable = true)
| | | |-- JPY: struct (nullable = true)
| | | | |-- last_updated: string (nullable = true)
| | | | |-- market_cap: double (nullable = true)
| | | | |-- percent_change_1h: double (nullable = true)
| | | | |-- percent_change_24h: double (nullable = true)
| | | | |-- percent_change_7d: double (nullable = true)
| | | | |-- price: double (nullable = true)
| | | | |-- volume_24h: double (nullable = true)
| | |-- slug: string (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- total_supply: long (nullable = true)
|-- status: struct (nullable = true)
| |-- credit_count: long (nullable = true)
| |-- elapsed: long (nullable = true)
| |-- error_code: long (nullable = true)
| |-- error_message: string (nullable = true)
| |-- notice: string (nullable = true)
| |-- timestamp: string (nullable = true)
It looks like you've parsed it correctly. You can access the nested elements using the dot notation:
json_df.select(
F.col('data.BTC.date_added').alias('date_added'),
F.col('data.BTC.quote.JPY.price').alias('price')
)
How can I turn this list of JSON objects into a Spark dataframe?
[
{
'1': 'A',
'2': 'B'
},
{
'1': 'A',
'3': 'C'
}
]
into
1 2 3
A B null
A null C
I've tried spark.read.json(spark.sparkContext.parallelize(d)) and various combinations of that with json.dumps(d).
I had to slay this dragon to import JIRA issues. They came back as a dataset of response objects, each containing an inner array of issue JSON objects.
This code worked as a single transformation to get to the properly-parsed JSON objects in a DataFrame:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
Schema is too big to show, but here's a snippet:
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- percent: long (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: long (nullable = true)
| |-- aggregatetimeoriginalestimate: long (nullable = true)
| |-- aggregatetimespent: long (nullable = true)
| |-- assignee: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- description: string (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10000: string (nullable = true)
| |-- customfield_10001: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- isShared: boolean (nullable = true)
| | |-- title: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- value: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: struct (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- issueType: struct (nullable = true)
| | | | |-- iconUrl: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- keyNum: long (nullable = true)
| | | |-- projectId: long (nullable = true)
| | | |-- summary: string (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10009: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- boardId: long (nullable = true)
| | | |-- completeDate: string (nullable = true)
| | | |-- endDate: string (nullable = true)
| | | |-- goal: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- startDate: string (nullable = true)
| | | |-- state: string (nullable = true)
...
You can use spark.createDataFrame(d) to get the desired effect.
You do get a deprecation warning about inferring schema from dictionaries, so the "right" way to do this is to first create the rows:
from pyspark.sql import Row
data = [{'1': 'A', '2': 'B'}, {'1': 'A', '3': 'C'}]
schema = ['1', '2', '3']
rows = []
for d in data:
dict_for_row = {k: d.get(k,None) for k in schema}
rows.append(Row(**dict_for_row))
then create the DataFrame:
df = spark.createDataFrame(row)
I have test JSON data at following link
http://developer.trade.gov/api/market-research-library.json
When I am trying to read schema directly from it in following manner
public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}
It is giving me error as
root
|-- _corrupt_record: string (nullable = true)
If it is malformed, how to convert it into format as expected by Spark.
Converting your json to single line.
Or set option("multiLine", true) to allow multiply line json.
If this is the only json you would like to convert to dataframe then I suggest you to go with wholeTextFiles api. Since the json is not in spark readable format, you can convert it to spark readable format only when whole of the data is read as one parameter and wholeTextFiles api does that.
Then you can replace the linefeed and spaces from the json string. And finally you should have required dataframe.
sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("\n", "").replace(" ", "")))
You should have your required dataframe with following schema
root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
| |-- Report: struct (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- click_url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- country: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- description: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- expiration_date: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- id: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- report_type: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- source_industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- title: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
| |-- /market_research_library/search: struct (nullable = true)
| | |-- get: struct (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- parameters: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- in: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- required: boolean (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | |-- responses: struct (nullable = true)
| | | | |-- 200: struct (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- schema: struct (nullable = true)
| | | | | | |-- items: struct (nullable = true)
| | | | | | | |-- $ref: string (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
| |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)
The format expected by spark is JSONL(JSON lines) which is not the standard JSON. Got to know this from here. Here's a small python script to convert your json to expected format:
import jsonlines
import json
with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)
Then you can access the file in your program as you have written in your code.
I want to read the JSON file in the below format:-
{
"titlename": "periodic",
"atom": [
{
"usage": "neutron",
"dailydata": [
{
"utcacquisitiontime": "2017-03-27T22:00:00Z",
"datatimezone": "+02:00",
"intervalvalue": 28128,
"intervaltime": 15
},
{
"utcacquisitiontime": "2017-03-27T22:15:00Z",
"datatimezone": "+02:00",
"intervalvalue": 25687,
"intervaltime": 15
}
]
}
]
}
I am writing my read line as:
sqlContext.read.json("user/files_fold/testing-data.json").printSchema
But I not getting the desired result-
root
|-- _corrupt_record: string (nullable = true)
Please help me on this
I suggest using wholeTextFiles to read the file and apply some functions to convert it to a single-line JSON format.
val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
map(tuple => tuple._2.replace("\n", "").trim)
val df = sqlContext.read.json(json)
You should have the final valid dataframe as
+--------------------------------------------------------------------------------------------------------+---------+
|atom |titlename|
+--------------------------------------------------------------------------------------------------------+---------+
|[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
+--------------------------------------------------------------------------------------------------------+---------+
And valid schema as
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
Spark 2.2 introduced multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
It probably has something to do with the JSON object stored inside your file, could you print it or make sure it's the one you provided in the question? I'm asking because I took that one and it runs just fine:
val json =
"""
|{
| "titlename": "periodic",
| "atom": [
| {
| "usage": "neutron",
| "dailydata": [
| {
| "utcacquisitiontime": "2017-03-27T22:00:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 28128,
| "intervaltime": 15
| },
| {
| "utcacquisitiontime": "2017-03-27T22:15:00Z",
| "datatimezone": "+02:00",
| "intervalvalue": 25687,
| "intervaltime": 15
| }
| ]
| }
| ]
|}
""".stripMargin
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.read
.json(spark.sparkContext.parallelize(Seq(json)))
.printSchema()
From the Apache Spark SQL Docs
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.
Thus,
{ "titlename": "periodic","atom": [{ "usage": "neutron", "dailydata": [ {"utcacquisitiontime": "2017-03-27T22:00:00Z","datatimezone": "+02:00","intervalvalue": 28128,"intervaltime":15},{"utcacquisitiontime": "2017-03-27T22:15:00Z","datatimezone": "+02:00", "intervalvalue": 25687,"intervaltime": 15 }]}]}
And then:
val jsonDF = sqlContext.read.json("file")
jsonDF: org.apache.spark.sql.DataFrame =
[atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>,
titlename: string]
This has already been answered nicely by other contributors, but I had one question which is how do i access each nested value/unit of the dataframe.
So, for collections, we can use explode and for struct types we can directly call the unit by dot(.).
scala> val a = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("file:///home/hdfs/spark_2.json")
a: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string]
scala> a.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
scala> val b = a.withColumn("exploded_atom", explode(col("atom")))
b: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 1 more field]
scala> b.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
scala>
scala> val c = b.withColumn("exploded_atom_struct", explode(col("`exploded_atom`.dailydata")))
c: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 2 more fields]
scala>
scala> c.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
scala> val d = c.withColumn("exploded_atom_struct_last", col("`exploded_atom_struct`.utcacquisitiontime"))
d: org.apache.spark.sql.DataFrame = [atom: array<struct<dailydata:array<struct<datatimezone:string,intervaltime:bigint,intervalvalue:bigint,utcacquisitiontime:string>>,usage:string>>, titlename: string ... 3 more fields]
scala> d.printSchema
root
|-- atom: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dailydata: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- datatimezone: string (nullable = true)
| | | | |-- intervaltime: long (nullable = true)
| | | | |-- intervalvalue: long (nullable = true)
| | | | |-- utcacquisitiontime: string (nullable = true)
| | |-- usage: string (nullable = true)
|-- titlename: string (nullable = true)
|-- exploded_atom: struct (nullable = true)
| |-- dailydata: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- datatimezone: string (nullable = true)
| | | |-- intervaltime: long (nullable = true)
| | | |-- intervalvalue: long (nullable = true)
| | | |-- utcacquisitiontime: string (nullable = true)
| |-- usage: string (nullable = true)
|-- exploded_atom_struct: struct (nullable = true)
| |-- datatimezone: string (nullable = true)
| |-- intervaltime: long (nullable = true)
| |-- intervalvalue: long (nullable = true)
| |-- utcacquisitiontime: string (nullable = true)
|-- exploded_atom_struct_last: string (nullable = true)
scala> val d = c.select(col("titlename"), col("exploded_atom_struct.*"))
d: org.apache.spark.sql.DataFrame = [titlename: string, datatimezone: string ... 3 more fields]
scala> d.show
+---------+------------+------------+-------------+--------------------+
|titlename|datatimezone|intervaltime|intervalvalue| utcacquisitiontime|
+---------+------------+------------+-------------+--------------------+
| periodic| +02:00| 15| 28128|2017-03-27T22:00:00Z|
| periodic| +02:00| 15| 25687|2017-03-27T22:15:00Z|
+---------+------------+------------+-------------+--------------------+
So thought of posting it here, in case if anyone has similar questions seeing this question.