PySpark read JSON with custom nested schema doesn't apply - json

I have this simple JSON file:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}
But when I'm trying to read it like this:
spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
I get:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
Which is wrong because adas_lane_keepAssist arguments are not correct.
If in source JSON I change one of the adas_lane_keepAssist arguments to "true", then the mapping is correct...
I also thought that maybe it's inferSchema the root of the problem, so I've made a custom_schema:
custom_schema = StructType([
StructField("adas",StructType([
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
])),
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
]))
]))
])
and read it like this:
spark.read.schema(custom_schema) \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
And I get the same wrong result:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
The funny thing is if I change the order in my custom_shema like this:
custom_schema = StructType([
StructField("adas",StructType([
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
])),
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
]))
]))
])
Now every argument of adas_parkAssist_front/left is wrong:
{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}
Is this a limitation of PySpark?

It's very strange to me too. I tried first, head and collect, but they all returned the same distorted structure. Before those lines, if I printed the schema, it was correct. So, the problem is in functions first, head, collect not working correctly with nested structs...
Looking for a workaround, I transformed the whole schema (which was correct after reading the JSON file) to a map type.
df = spark.read.json(r"path\test_file.json")
df = df.withColumn('adas', F.create_map(
F.lit('lane'), F.create_map(
F.lit('keepAssist'), F.create_map(
F.lit('left'), F.col('adas.lane.keepAssist.left'),
F.lit('right'), F.col('adas.lane.keepAssist.right')
)
),
F.lit('parkAssist'), F.create_map(
F.lit('front'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.front.alarm'),
F.lit('muted'), F.col('adas.parkAssist.front.muted')
),
F.lit('rear'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.rear.alarm'),
F.lit('muted'), F.col('adas.parkAssist.rear.muted')
)
)
))
print(df.head().asDict())
# {'adas': {'lane': {'keepAssist': {'left': False, 'right': False}}, 'parkAssist': {'rear': {'alarm': False, 'muted': False}, 'front': {'alarm': False, 'muted': False}}}}

My spark version was 3.1.1.
After updating it to 3.2.0, the custom nested schema was read as expected.
Thx !

Related

adding a local html file into a Div in plotly/dash

I created a Dash app. On one tab I have a dropdown and a submit button. In a callback I generate an html file, saved locally, based on quantstats library (reports.html)
I am trying to producing this html within an html.Div but no success so far. All I managed to do is to print out an hyperlink to my local file but not the file itself.
Here is the layout part:
layout = html.Div([
dbc.Container([
dbc.Row([
dbc.Col(html.H1("Reports", className="text-center"), className="mb-5 mt-5")
]),
dbc.Row([
dbc.Col(dcc.Dropdown(id='reports_type',
options=[
{'label': 'YTD', 'value': 1},
{'label': 'ITD', 'value': 2},
],
value=1,
style={'display': 'block'},
)),
dbc.Col(html.Button('Run report', id='reports_submit')),
])
]),
html.Div(id='reports_holder')
])
Here is the callback:
#app.callback(
Output('reports_holder', 'children'),
[
Input('reports_submit', 'n_clicks')
],
[
State('reports_type', 'value')
],
prevent_initial_call=True
)
def produce_report(n_clicks, reports_type):
if n_clicks:
df = some dataframe
if reports_type == 1: #YTD
current_df = df[df.index.year == datetime.date.today().year]
my_pfo = pd.Series(current_df['Close'].values, name='Close', index=current_df.index)
qs.reports.html(my_pfo, "SPY", output='quantstats-tearsheet.html', title='YTD Tearsheet')
else: #ITD
my_pfo = pd.Series(df['Close'].values, name='Close', index=df.index)
qs.reports.html(my_pfo, "SPY", output='quantstats-tearsheet.html', title='ITD Tearsheet')
f = codecs.open('quantstats-tearsheet.html', 'r')
return f
Here is my latest try using the codecs library (this creates an error because the html documents cannot be JSON serializable). If I return f.read() instead then this is not html formated.
As said before I tried html.A instead of html.Div in my layout but this produces only an hyperlink (as expected with html.A). Should I use the property target with html.A but then how do I sepcify that the document should be opened in that same html.A? There must be a straightforward solution to this but I can't find it (most solutions involve opening a local html file into a new browser tab...)
Edit: here is an example of the html report generated via quantstats

Scala - How to Split all List of List Json Nodes using json-path

I have a Json from which I want to pick List of List Json, where instance can be multiple inside List. Using json-path easily we can pick if giving index number of List/Array. But in a Big File we don't know total how many instance will be there and we have not to loose any data. So number of instance has to be check in a dynamic way and pick seperate json for all inside List node. Additionally has to create relation_path also for all the Data.
Can Anyone suggest How to check if a json node is Array/List (Ex : 2 Drive) and how many nested List objects are available like 2 Partition in 1st Drive and 1 Partition in 2nd Drive. These numbers are not fixed to be provide in json-path code.
Input List of List Json :
{"Start":{"HInfo":{"InfoId":"650FEC74","Revision":"5.2.0.51","Drive":[{"InfoId":"650FEC74","Index":0,"Name":"Drive0","Partition":[{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":0},{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":1}]},{"InfoId":"650FEC74","Index":1,"Name":"Drive1","Partition":{"InfoId":"650FEC74","DriveID":"3F275869","Index":0}}]}}}
Output List of Json :
[{"Partition":[{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":0},{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":1}],"relation_tree":"Start/HInfo/Drive/Drive-1/Partition"},{"Partition":{"InfoId":"650FEC74","DriveID":"3F275869","Index":0},"relation_tree":"Start/HInfo/Drive/Drive-2/Partition"}]
What I am trying using json-path, but this is not fittable as I here I am providing Index Number manually, which is not possible in all the case as index number can be 0 to any.
val jsonString = """{"Start":{"HInfo":{"InfoId":"650FEC74","Revision":"5.2.0.51","Drive":[{"InfoId":"650FEC74","Index":0,"Name":"Drive0","Partition":[{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":0},{"InfoId":"650FEC74","DriveID":"F91B1F36","Index":1}]},{"InfoId":"650FEC74","Index":1,"Name":"Drive1","Partition":{"InfoId":"650FEC74","DriveID":"3F275869","Index":0}}]}}}"""
val jsonStr: JsValue = Json.parse(jsonString)
var pruneJson1 = (__ \ "Partition").json.copyFrom((__ \ "Start" \ "HInfo" \ "Drive" \ (0) \ "Partition").json.pick)
val finalPartitionPrune1 = Option(jsonStr.transform(pruneJson1)).get.get.as[JsObject] + ("relation_tree" -> Json.toJson("Start"+"/"+"HInfo"+"/"+"Drive"+"/"+"Drive-1"+"/"+"Partition"))
println(finalPartitionPrune1)
var pruneJson2 = (__ \ "Partition").json.copyFrom((__ \ "Start" \ "HInfo" \ "Drive" \ (1) \ "Partition").json.pick)
val finalPartitionPrune2 = Option(jsonStr.transform(pruneJson2)).get.get.as[JsObject] + ("relation_tree" -> Json.toJson("Start"+"/"+"HInfo"+"/"+"Drive"+"/"+"Drive-2"+"/"+"Partition"))
println(finalPartitionPrune2)
This is the simplest solution I could think of:
val finalJson = Json.toJson(
(jsonStr \ "Start" \ "HInfo" \ "Drive")
.as[Seq[JsValue]]
.map(jsValue => JsObject(Seq(
"Partition" -> (jsValue \ "Partition").get,
"relation_tree" -> JsString(s"Start/HInfo/Drive/Drive-${(jsValue \ "Index").get}/Partition")))))
Basically it reads all drives as sequence of JsValues and then maps them to JsObjects with needed format. It uses Index value of drive to create relation_tree value, so it will fail if this value is missing. As an alternative you can use zipWithIndex method to add your own indices to sequence. As a final step it converts sequence back to JsValue
Here's zipWithIndex version:
val finalJson = Json.toJson(
(jsonStr \ "Start" \ "HInfo" \ "Drive")
.as[Seq[JsValue]]
.zipWithIndex
.map{ case (jsValue, index) => JsObject(Seq(
"Partition" -> (jsValue \ "Partition").get,
"relation_tree" -> JsString(s"Start/HInfo/Drive/Drive-$index/Partition")))
})

Example for Ruby JSON.parse option create_additions?

I'm working on some document enhancements and example code snippets for Ruby's JSON class. I'm puzzled by this option to JSON.parse:
create_additions: If set to false, the Parser doesn't create additions even if a matching class and ::create_id was found. This option defaults to false.
Could someone please provide example code for using this?
Consider this:
require 'json'
class Range
def to_json(*a)
{
'json_class' => self.class.name,
'data' => [ first, last, exclude_end? ]
}.to_json(*a)
end
def self.json_create(o)
new(*o['data'])
end
end
foo = 1 .. 2
Generating JSON:
JSON.generate(foo) # => "{\"json_class\":\"Range\",\"data\":[1,2,false]}"
JSON.generate(foo, { create_additions: false }) # => "{\"json_class\":\"Range\",\"data\":[1,2,false]}"
JSON.generate(foo, { create_additions: true }) # => "{\"json_class\":\"Range\",\"data\":[1,2,false]}"
Parsing the generated JSON:
JSON.parse( JSON.generate(foo) ) # => {"json_class"=>"Range", "data"=>[1, 2, false]}
JSON.parse( JSON.generate(foo), { create_additions: false } ) # => {"json_class"=>"Range", "data"=>[1, 2, false]}
JSON.parse( JSON.generate(foo), { create_additions: true } ) # => 1..2
"2.4.3. JSON.parse and JSON.load" demonstrates a potential bug in JSON that affected create_additions. From there it was a simple thing, just some lines testing the result of toggling the state.
Why they had to close the security hole is for you to research as it involves the specification for JSON serialized data and it being a data-exchange standard, and an example in the JSON docs needs to cover that.
The example is right there in the documentation: https://ruby-doc.org/stdlib-2.6.3/libdoc/json/rdoc/JSON.html#module-JSON-label-Extended+rendering+and+loading+of+Ruby+objects.
The main difference in this respect between parse and load is that the former defaults to not create additions, the latter defaults to do it.
Extended rendering and loading of Ruby objects
provides optional additions allowing to serialize and deserialize Ruby
classes without loosing their type.
# without additions
require "json"
json = JSON.generate({range: 1..3, regex: /test/})
# => '{"range":"1..3","regex":"(?-mix:test)"}'
JSON.parse(json)
# => {"range"=>"1..3", "regex"=>"(?-mix:test)"}
# with additions
require "json/add/range"
require "json/add/regexp"
json = JSON.generate({range: 1..3, regex: /test/})
# => '{"range":{"json_class":"Range","a":[1,3,false]},"regex":{"json_class":"Regexp","o":0,"s":"test"}}'
JSON.parse(json)
# => {"range"=>{"json_class"=>"Range", "a"=>[1, 3, false]}, "regex"=>{"json_class"=>"Regexp", "o"=>0, "s"=>"test"}}
JSON.load(json)
# => {"range"=>1..3, "regex"=>/test/}
See #load for details.

Mapping a sequence of results from Slick monadic join to Json

I'm using Play 2.4 with Slick 3.1.x, specifically the Slick-Play plugin v1.1.1. Firstly, some context... I have the following search/filter method in a DAO, which joins together 4 models:
def search(
departureCity: Option[String],
arrivalCity: Option[String],
departureDate: Option[Date]
) = {
val monadicJoin = for {
sf <- slickScheduledFlights.filter(a =>
departureDate.map(d => a.date === d).getOrElse(slick.lifted.LiteralColumn(true))
)
fl <- slickFlights if sf.flightId === fl.id
al <- slickAirlines if fl.airlineId === al.id
da <- slickAirports.filter(a =>
fl.departureAirportId === a.id &&
departureCity.map(c => a.cityCode === c).getOrElse(slick.lifted.LiteralColumn(true))
)
aa <- slickAirports.filter(a =>
fl.arrivalAirportId === a.id &&
arrivalCity.map(c => a.cityCode === c).getOrElse(slick.lifted.LiteralColumn(true))
)
} yield (fl, sf, al, da, aa)
db.run(monadicJoin.result)
}
The output from this is a Vector containing sequences, e.g:
Vector(
(
Flight(Some(1),123,216,2013,3,1455,2540,3,905,500,1150),
ScheduledFlight(Some(1),1,2016-04-13,90,10),
Airline(Some(216),BA,BAW,British Airways,United Kingdom),
Airport(Some(2013),LHR,Heathrow,LON,...),
Airport(Some(2540),JFK,John F Kennedy Intl,NYC...)
),
(
etc ...
)
)
I'm currently rendering the JSON in the controller by calling .toJson on a Map and inserting this Vector (the results param below), like so:
flightService.search(departureCity, arrivalCity, departureDate).map(results => {
Ok(
Map[String, Any](
"status" -> "OK",
"data" -> results
).toJson
).as("application/json")
})
While this sort of works, it produces output in an unusual format; an array of results (the rows) within each result object the joins are nested inside objects with keys: "_1", "_2" and so on.
So the question is: How should I go about restructuring this?
There doesn't appear to be anything which specifically covers this sort of scenario in the Slick docs. Therefore I would be grateful for some input on what might be the best way to refactor this Vector of Seq's, with a view to renaming each of the joins or even flattening it out and only keeping certain fields?
Is this best done in the DAO search method before it's returned (by mapping it somehow?) or in the controller after I get back the Future results Vector from the search method?
Or I'm wondering whether it would be preferable to abstract this sort of mutation out somewhere else entirely, using a transformer perhaps?
You need JSON Reads/Writes/Format Combinators
In the first place you must have Writes[T] for all your classes (Flight, ScheduledFlight, Airline, Airport).
Simple way is using Json macros
implicit val flightWrites: Writes[Flight] = Json.writes[Flight]
implicit val scheduledFlightWrites: Writes[ScheduledFlight] = Json.writes[ScheduledFlight]
implicit val airlineWrites: Writes[Airline] = Json.writes[Airline]
implicit val airportWrites: Writes[Airport] = Json.writes[Airport]
You must implement OWrites[(Flight, ScheduledFlight, Airline, Airport, Airport)] for Vector item also. For example:
val itemWrites: OWrites[(Flight, ScheduledFlight, Airline, Airport, Airport)] = (
(__ \ "flight").write[Flight] and
(__ \ "scheduledFlight").write[ScheduledFlight] and
(__ \ "airline").write[Airline] and
(__ \ "airport1").write[Airport] and
(__ \ "airport2").write[Airport]
).tupled
for writing whole Vector as JsAray use Writes.seq[T]
val resultWrites: Writes[Seq[(Flight, ScheduledFlight, Airline, Airport, Airport)]] = Writes.seq(itemWrites)
We have all to response your data
flightService.search(departureCity, arrivalCity, departureDate).map(results =>
Ok(
Json.obj(
"status" -> "Ok",
"data" -> resultWrites.writes(results)
)
)

Play Framework: Converting strings to numbers while validating JSON does not work

Given the following JSON...
{
"ask":"428.00",
"bid":"424.20"
}
... I need to convert the values of ask and bid to numbers:
{
"ask": 428.00,
"bid": 424.20
}
As already discussed here, I just need to create a validator like this:
def validate = (
((__ \ 'ask).json.update(toNumber)) ~
((__ \ 'bid).json.update(toNumber))
).reduce
private def toNumber(implicit reads: Reads[String]) = {
Reads[JsNumber](js =>
reads.reads(js).flatMap { value =>
parse[Double](value) match {
case Some(number) => JsSuccess(JsNumber(number))
case _ => JsError(ValidationError("error.number", value))
}
}
)
}
The problem is that only the last node (bid) gets actually converted to a number... and the resulting JSON looks like this:
}
"ask":"428.00",
"bid":424.20
}
Am I missing something?
EDIT
Using andThen only works if the JSON structure only contains strings to convert to numbers... whereas if the JSON structure already contains numeric fields it doesn't. Given the following JSON [last is already numeric]:
}
"ask":"428.00",
"bid":"424.20",
"last": 430.05
}
If I modify my validator like this [replaced ~ with andThen and removed reduced]...
def validate = (
((__ \ 'ask).json.update(toNumber)) andThen
((__ \ 'bid).json.update(toNumber)) andThen
((__ \ 'last).json.pickBranch(Reads.of[JsNumber]))
)
... then I get the following error when trying to validate my JSON above:
JsError(List((/bid/last,List(ValidationError(error.path.missing,WrappedArray())))))
Reviewing the docs, it looks like you should be using "andThen", not "~". See "Case 7".