Aggregate JSON files in Spark RDD - json

I have a series of files that look similar to this:
[
{
'id':1,
'transactions': [
{
'date': '2019-01-01',
'amount': 50.50
},
{
'date': '2019-01-02',
'amount': 10.20
},
]
},
{
'id':2,
'transactions': [
{
'date': '2019-01-01',
'amount': 10.20
},
{
'date': '2019-01-02',
'amount': 0.50
},
]
}
]
I load these files to Spark using the following code
users= spark.read.option("multiline", "true").json(file_location)
The result is a SparkData Frame with two columns id and transactions where transactions is a StructType.
I want to be able to "map" the transactions per user to aggregate them.
Currently I am using rdd and a function that looks like this:
users.rdd.map(lambda a: summarize_transactions(a.transactions))
The summarize function can be of two types:
a) Turn the list of objects into a Pandas Dataframe to summarize it.
b) Iterate over the list of objects to summarize it.
However I find out that a.transactions is a list of pyspark.sql.types.Row. Instead of actual dictionaries.
1) Is this the best way to accomplish my goal?
2) How can I turn the list of Spark Rows into the original list of Dictionaries?

I found a way to solve my own problem:
STEP 1: LOAD DATA AS TEXTFILE:
step1= sc.textFile(file_location)
STEP 2: READ AS JSON AND FLATMAP
import json
step2 = step1.map(lambda a: json.loads(a)).flatMap(lambda a: a)
STEP 3: KEY MAP REDUCE
setp3 = (
step2
.map(lambda line: [line['id'], line['transactions']])
.reduceByKey(lambda a, b: a + b)
.mapValues(lambda a: summarize_transactions(a))
)

Related

Table from nested list, struct

I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]
flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat is just a StructArray (not a ChunkedArray), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)

FastAPI not returning list of SQLAlchemy rows properly

I am trying to return a list of SQLAlchemy rows and output it from a FastAPI endpoint. Each row in the list consists of the actor's name and the actor's total number of lines from a show. The query itself I believe is correct, but when viewing the output from the endpoint, one of the columns is missing for some reason.
The query function:
def get_actors(db: Session, detailed: bool = False) -> list[str]:
"""Return a list of actors and their total lines from the show"""
if not detailed:
query = (
db.query(models.Script.actor, func.count(models.Script.detail)) # (<actor name>, <total lines>)
.filter(models.Script.actor.isnot(None)) # Skip null actors.
.group_by(models.Script.actor) # Group unique actors.
.order_by(func.count(models.Script.detail).desc()) # Sort by most to least lines.
)
actors_list = [actor for actor in query]
print("ACTORS LIST:", actors_list) # Debug print, looks fine.
return actors_list
...
FastAPI endpoint
#holy_api.get("/actors", response_class=PrettyJSONResponse)
def get_actors(detailed: bool = False, db: Session = Depends(get_db)):
"""Get a list of all the actors from the show with their total lines, optionally detailed view"""
return crud.get_actors(db, detailed=detailed)
Now, when I open up /actors, the terminal prints out the debug looking like:
ACTORS LIST: [('Michael Palin', 2454), ('Eric Idle', 2107), ('John Cleese', 2044), ('Graham Chapman', 1848), ('Terry Jones', 1801), ('Carol Cleveland', 277), ('Terry Gilliam', 85), ('Terry\nJones', 35), ('Neil Innes', 12), ('Ian Davidson', 8), ('Connie Booth', 5), ('Katya Wyeth', 4), ('Rita Davies', 3), ('Marjorie Wilde', 3), ('Donna Reading', 2), ('Nicki Howorth', 1), ('Julia Breck', 1), ('Caron Gardener', 1)]
INFO: 127.0.0.1:54057 - "GET /actors HTTP/1.1" 200 OK
This is exactly what I need. But the actual JSON response looks like:
[
{
"actor": "Michael Palin"
},
{
"actor": "Eric Idle"
},
{
"actor": "John Cleese"
},
{
"actor": "Graham Chapman"
},
{
"actor": "Terry Jones"
},
{
"actor": "Carol Cleveland"
},
...
]
The total lines aren't shown next to the actor. Why? I am not sure if this is a problem with FastAPI not parsing the JSON properly or if this is a SQLAlchemy thing or I'm doing something plainly wrong.
Wow! Shortly after posting this question, I fixed it. All I had to do was attach a label to the aggregate count function: func.count(models.Script.detail).label("total_lines") I've been on this problem for hours and I feel really, really dumb right now.

How to read dynamic/changing keys from JSON in Python

I am using the UK Bus API to collect bus arrival times etc.
In Python 3 I have been using....
try:
connection = http.client.HTTPSConnection("transportapi.com")
connection.request("GET", "/v3/uk/bus/stop/xxxxxxxxx/live.json?app_id=xxxxxxxxxxxxxxx&app_key=xxxxxxxxxxxxxxxxxxxxxxxxxxx&group=route&nextbuses=yes")
res = connection.getresponse()
data = res.read()
connection.close()
from types import SimpleNamespace as Namespace
x = json.loads(data, object_hook=lambda d: Namespace(**d))
print("Stop Name : " + x.stop_name)
Which is all reasonably simple, however the JSON data returned looks like this...
{
"atcocode":"xxxxxxxx",
"smscode":"xxxxxxxx",
"request_time":"2020-03-10T15:42:22+00:00",
"name":"Hospital",
"stop_name":"Hospital",
"bearing":"SE",
"indicator":"adj",
"locality":"Here",
"location":{
"type":"Point",
"coordinates":[
-1.xxxxx,
50.xxxxx
]
},
"departures":{
"8":[
{
"mode":"bus",
"line":"8",
"line_name":"8",
"direction":"North",
"operator":"CBLE",
"date":"2020-03-10",
Under "departures" the key name changes due to the bus number / code.
Using Python 3 how do I extract the key name and all subsequent values below/within it?
Many thanks for any help!
You can do this:
for k,v in x["departures"].items():
print(k,v) #or whatever else you wanted to do.
Which returns:
8 [{'mode': 'bus', 'line': '8', 'line_name': '8', 'direction': 'North', 'operator': 'CBLE', 'date': '2020-03-10'}]
So k is equal to 8 and v is the value.

How to take any CSV file and convert it to JSON?(with python as a script engine) [Novice user trying to learn NiFi]

1) There is a CSV file containing the following information (the first row is the header):
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
[
{
"first": 1,
"second": 4,
"third": 9,
"total": 14
},
{
"first": 7,
"second": 5,
"third": 2,
"total": 14
},
{
"first": 3,
"second": 8,
"third": 7,
"total": 18
}
]
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
NiFi flow:
Although i'm not into Python, by just googling around i think this might do it:
import csv
with open("YOURFILE.csv") as f:
reader = csv.DictReader(f)
data = [r for r in reader]
import json
with open('result.json', 'w') as outfile:
json.dump(data, outfile)
You can use Query Record processor and add new property as
total
select first,second,third,first+second+third total from FLOWFILE
Configure the CsvReader controller service with matching avro schema with int as datatype for all the fields and Json Setwriter controller service,Include total field name so that the output from Query Record processor will be all the columns and the sum of the columns as total.
Connect total relationship from Query Record processor for further processing
Refer to these links regarding Query Record and Configure Record Reader/Writer

Which R object does return this JSON structure?

After R to JSON conversion, this should be the output:
{
"alpha": [100,120,140,150,160],
"beta": [0.6, 1, 1.5, 2],
"gamma": [
[
0.018429082998491217,
-0.1973461380810494,
0.6373366343601572,
0.1533790888325718,
0.014712015654254968
],
[
0.012075950866910893,
-0.14585424179257,
0.6591589092698342,
0.2571689477155383,
0.010925520086793088
],
[
0.0159193430322232,
-0.146917626129837,
0.4710901890006199,
0.15728143658310957,
0.012566273548505473
],
[
0.017317835334994967,
-0.1549043092753231,
0.4882454969264185,
0.1300951912298256,
0.013437976685378085
]
]
}
This describes a matrix: alpha and beta are arrays which index the matrix depicted below by columns.
rjson::toJSON() function from rjson takes vector or list. However, it doesn't split an R matrix (with named rows and columns) in arrays; instead, it generates an array of row values and then names each column by its column name.
I cannot really figure out which R data structure allows producing such a file format.
Could you show me the R code that uses rjson::toJSON() function and generates that output?