How to expand nested JSON into Spark dataframe on AWS glue - json

Working with the following marketing JSON file
{
"request_id": "xx",
"timeseries_stats": [
{
"timeseries_stat": {
"id": "xx",
"timeseries": [
{
"start_time": "xx",
"end_time": "xx",
"stats": {
"impressions": xx,
"swipes": xx,
"view_completion": xx,
"spend": xx
}
},
{
"start_time": "xx",
"end_time": "xx",
"stats": {
"impressions": xx,
"swipes": xx,
"view_completion": xx,
"spend": xx
}
}
I can parse this using pandas very easily and obtain the desired dataframe in the format
start_time end_time impressions swipes view_completion spend
xx xx xx xx xx xx
xx xx xx xx xx xx
but need to do it in spark on AWS Glue.
After creating an initial spark dataframe (df) using
rdd = sc.parallelize(JSON_resp['timeseries_stats'][0]['timeseries_stat']['timeseries'])
df = rdd.toDF()
I tried expanding the stats key as follows
df_expanded = df.select("start_time","end_time","stats.*")
Error:
AnalysisException: 'Can only star expand struct data types.
Attribute: `ArrayBuffer(stats)`;'
&
from pyspark.sql.functions import explode
df_expanded = df.select("start_time","end_time").withColumn("stats", explode(df.stats))
Error:
AnalysisException: 'The number of aliases supplied in the AS clause does not match the
number of columns output by the UDTF expected 2 aliases but got stats ;
Pretty new to spark, any help would be much appreciated for either of the 2 approaches!
It's a pretty similar problem as in:
parse array of dictionaries from JSON with Spark
except I need to flatten this additional stats key.

When you explode a map column, it will give you two columns and so .withColumn is not working. Use explode with select statement.
from pyspark.sql import functions as f
df.select('start_time', 'end_time', f.explode('stats')) \
.groupBy('start_time', 'end_time').pivot('key').agg(f.first('value')).show()
+----------+--------+-----------+-----+------+---------------+
|start_time|end_time|impressions|spend|swipes|view_completion|
+----------+--------+-----------+-----+------+---------------+
| yy| yy| yy| yy| yy| yy|
| xx| xx| xx| xx| xx| xx|
+----------+--------+-----------+-----+------+---------------+

Related

pandas json object read error when using flask request get_json

I am passing a JSON object through a POST API into a flask app. The goal is to convert it to a single row pandas DF and pass it on for further processing.
the JSON payload is as follows:
{
"ABC": "123",
"DATE": "2020-01-01",
"AMOUNT": "100",
"IDENTIFIER": "12345"
}
The output of data=flask.request.get_json() and print(data) is
{'ABC': '123', 'DATE': '2020-01-01', 'AMOUNT': '100','IDENTIFIER': '12345'}
But when I do a pd.read_json(data) on it I get an error
ValueError: Invalid file path or buffer object type: <class 'dict'>
Any ideas on how to handle this? I need the output to be
ABC DATE AMOUNT IDENTIFIER
123 2020-01-01 100 12345
Thanks!
Try this:
import pandas as pd
df = pd.DataFrame([data.values()], columns=data.keys())
print(df)
Output:
ABC DATE AMOUNT IDENTIFIER
0 123 2020-01-01 100 12345

How to explode structs with pyspark explode()

How do I convert the following JSON into the relational rows that follow it? The part that I am stuck on is the fact that the pyspark explode() function throws an exception due to a type mismatch. I have not found a way to coerce the data into a suitable format so that I can create rows out of each object within the source key within the sample_json object.
JSON INPUT
sample_json = """
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}"""
DESIRED OUTPUT
dc_id source_name id description
-------------------------------------------------------------------------------
dc-101 sensor-gauge 10 Sensor attached to the container ceilings
dc-101 sensor-ipad 13 Sensor ipad attached to carbon cylinders
dc-101 sensor-inest 8 Sensor attached to the factory ceilings
dc-101 sensor-istick 5 Sensor embedded in exhaust pipes in the ceilings
PYSPARK CODE
from pyspark.sql.functions import *
df_sample_data = spark.read.json(sc.parallelize([sample_json]))
df_expanded = df_sample_data.withColumn("one_source",explode_outer(col("source")))
display(df_expanded)
ERROR
AnalysisException: cannot resolve 'explode(source)' due to data type
mismatch: input to function explode should be array or map type, not
struct....
I put together this Databricks notebook to further demonstrate the challenge and clearly show the error. I will be able to use this notebook to test any recommendations provided herein.
You can't use explode for structs but you can get the column names in the struct source (with df.select("source.*").columns) and using list comprehension you create an array of the fields you want from each nested struct, then explode to get the desired result :
from pyspark.sql import functions as F
df1 = df.select(
"dc_id",
F.explode(
F.array(*[
F.struct(
F.lit(s).alias("source_name"),
F.col(f"source.{s}.id").alias("id"),
F.col(f"source.{s}.description").alias("description")
)
for s in df.select("source.*").columns
])
).alias("sources")
).select("dc_id", "sources.*")
df1.show(truncate=False)
#+------+-------------+---+------------------------------------------------+
#|dc_id |source_name |id |description |
#+------+-------------+---+------------------------------------------------+
#|dc-101|sensor-igauge|10 |Sensor attached to the container ceilings |
#|dc-101|sensor-inest |8 |Sensor attached to the factory ceilings |
#|dc-101|sensor-ipad |13 |Sensor ipad attached to carbon cylinders |
#|dc-101|sensor-istick|5 |Sensor embedded in exhaust pipes in the ceilings|
#+------+-------------+---+------------------------------------------------+

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

Count unique values in objects within large JSON file with Python

I have some rather large JSON files. Each contains thousands of objects within one (1) array. The JSONs are structured in the following format:
{
"alert": [
{ "field1": "abc",
"field2": "def",
"field3": "xyz
},
{ "field1": null,
"field2": null,
"field3": "xyz",
},
...
...
]
What's the most efficient way to use Python and the json library to search through a JSON file, find the unique values in each object within the array, and count how many times they appear? E.g., search the array's "field3" objects for the value "xyz" and count how many times it appears. I tried a few variations based on existing solutions in StackOverflow, but they are not providing the results I'm looking for.
A quick search on PyPI turned up
ijson 2.3 - Iterative JSON parser with a standard Python iterator interface
https://pypi.python.org/pypi/ijson
Here's an example which should work for your data
import ijson
counts = {}
with file("data.json") as f:
objects = ijson.items(f, 'alert.item')
for o in objects:
for k, v in o.items():
field = counts.get(k,{})
total = field.get(v,0)
field[v] = total + 1
counts[k] = field
import json
print json.dumps(counts, indent=2)
running this with your sample data in data.json produces
{
"field2": {
"null": 1,
"def": 1
},
"field3": {
"xyz": 2
},
"field1": {
"null": 1,
"abc": 1
}
}
Note however that the null in your input was transformed into the string "null".
As a point of comparison, here is a jq command which produces an equivalent result using tostream
jq -M '
reduce (tostream|select(length==2)) as [$p,$v] (
{}
; ($p[2:]+[$v|tostring]) as $k
| setpath($k; getpath($k)+1)
)
' data.json

Creating an aggregate metrics from JSON logs in apache spark

I am getting started with apache spark.
I have a requirement to convert a json log to a flattened metrics, can be considered as a simple csv as well.
For eg.
"orderId":1,
"orderData": {
"customerId": 123,
"orders": [
{
"itemCount": 2,
"items": [
{
"quantity": 1,
"price": 315
},
{
"quantity": 2,
"price": 300
},
]
}
]
}
This can be considered as a single json log, I want to convert this into,
orderId,customerId,totalValue,units
1 , 123 , 915 , 3
I was going through sparkSQL documentation and can use it to get hold of individual values like "select orderId,orderData.customerId from Order" but I am not sure how to get the summation of all the prices and units.
What should be the best practice to get this done using apache spark?
Try:
>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))
For the people who are looking for a java solution of the above, please follow:
SparkSession spark = SparkSession
.builder()
.config(conf)
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> orders = sqlContext.read().json("order.json");
Dataset<Row> newOrders = orders.select(
col("orderId"),
col("orderData.customerId"),
explode(col("orderData.orders")).alias("order"))
.withColumn("item",explode(col("order.items")))
.groupBy(col("orderId"),col("customerId"))
.agg(sum(col("item.quantity")),sum(col("item.price")));
newOrders.show();