Using Pandas to Flatten a JSON with a nested array - json

Have the following JSON. I want to pullout task flatten it and put into own data frame and include the ID from the parent
[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"taskId":898989,
"status":"Closed"
},
{
"assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
"resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
"taskId":999999
}
],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]
I would like to get a dataframe from tasks like so
id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status
I have flattened the entire dataframe using
df=pd.json_normalize(json.loads(df.to_json(orient='records')))
It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.
I have id and tasks in a dataframe like so
tasksdf=storiesdf[['tasks','id']]
then i want to normalize it like
tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))
but I know since it is in an array I need to do something different. However I have not been able to figure it out. I have been looking at other examples and reading what others have done. Any help would be appreciated.

The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.
Secondly, some columns are redundant between assignee, resolvedBy and the nested task. I would therefore create the assignee.id, resolved.id...etc columns first and merge them with the normalized task:
json_data = json.loads(json_str)
df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')
df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)
df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)
df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)
print(df.drop_duplicates().reset_index(drop=True))
Output:
id state assignee.id assignee.firstName ... resolvedBy.firstName resolvedBy.lastName taskId status
0 123456.0 Complete 5757 Jim ... Jim Johnson 898989.0 Closed
1 123477.0 Inprogress 8576 Jack ... None None NaN NaN
2 123456 Complete 5857 Nacy ... George Johnson 999999.0 NaN

Related

Most efficient way to read concatenated json in PySpark?

I'm getting one json file where each line in the json is a json itself of 1000 objects, like this:
{"id":"test1", "results": [{"property1": "sample1"},{"property2": "sample2"}]}
{"id":"test2", "results": [{"property1": "sample3"},{"property2": "sample4"}]}
If I read it as a json using spark.read.json(filepath), I'm getting:
+-----+--------------------+
| id| results|
+-----+--------------------+
|test1|[{sample1, null},...|
+-----+--------------------+
(Which is only the first json in the concatenated json)
While I'm trying to get:
+-----+---------+---------+
|id |property1|property2|
+-----+---------+---------+
|test1|sample1 |sample1 |
|test2|sample3 |sample4 |
+-----+---------+---------+
I end up by reading the json as text, and iterate over each row to treat it as json and union each dataframe:
df = (spark.read.text(data[self.files]))
dataCollect = df.collect()
i = 0
for row in dataCollect:
df_row = flatten_json(spark.read.json(spark.sparkContext.parallelize(row)))
if i == 0:
df_all = df_row
else:
df_all = df_row.unionByName(df_all, allowMissingColumns = True)
i = i + 1
flatten_json is a helper that helps me to automatically flatten the json.
I guess there is a better approach, any help would be much appreciate
Your JSON file is called JSON Lines or JSONL which is a supported file format that Pyspark can handle natively. So, use the regular spark.read.json to read it and perform the additional transformations to match with what you want.
df = spark.read.json('yourfile.json or json/directory')
# Explode the array into structs. This will generate lots of nulls.
df = (df.select('id', F.explode('results').alias('results'))
.select('id', 'results.*'))
# Group them and aggregate to remove the nulls.
df = (df.groupby('id')
.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x != 'id']))
I think this works fine for 1000 lines JSONL, however, if you are curious about alternative solution that doesn't involve generating/removing nulls, please check here: By using PySpark how to parse nested json. In some situations, the alternative solution which doesn't do explode could be more performant.

Postgres json to view

I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]

How to add field within nested JSON when reading from/writing to Kafka via a Spark dataframe

I've a Spark (v.3.0.1) job written in Java that reads Json from Kafka, does some transformation and then writes it back to Kafka. For now, the incoming message structure in Kafka is something like:
{"catKey": 1}. The output from the Spark job that's written back to Kafka is something like: {"catKey":1,"catVal":"category-1"}. The code for processing input data from Kafka goes something as follows:
DataFrameReader dfr = putSrcProps(spark.read().format("kafka"));
for (String key : srcProps.stringPropertyNames()) {
dfr = dfr.option(key, srcProps.getProperty(key));
}
Dataset<Row> df = dfr.option("group.id", getConsumerGroupId())
.load()
.selectExpr("CAST(value AS STRING) as value")
.withColumn("jsonData", from_json(col("value"), schemaHandler.getSchema()))
.select("jsonData.*");
// transform df
df.toJSON().write().format("kafka").option("key", "val").save()
I want to change the message structure in Kafka. Now, it should be of the format: {"metadata": <whatever>, "payload": {"catKey": 1}}. While reading, we need to read only the contents of the payload, so the dataframe remains similar. Also, while writing back to Kafka, first I need to wrap the msg in payload, add a metadata. The output will have to be of the format: {"metadata": <whatever>, "payload": {"catKey":1,"catVal":"category-1"}}. I've tried manipulating the contents of the selectExpr and from_json method, but no luck so far. Any pointer on how to achieve the functionality would be very much appreciated.
To extract the content of payload in your JSON you can use get_json_object. And to create the new output you can use the built-in functions struct and to_json.
Given a Dataframe:
val df = Seq(("""{"metadata": "whatever", "payload": {"catKey": 1}}""")).toDF("value").as[String]
df.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|{"metadata": "whatever", "payload": {"catKey": 1}}|
+--------------------------------------------------+
Then creating the new column called "value"
val df2 = df
.withColumn("catVal", lit("category-1")) // whatever your logic is to fill this column
.withColumn("payload",
struct(
get_json_object(col("value"), "$.payload.catKey").as("catKey"),
col("catVal").as("catVal")
)
)
.withColumn("metadata",
get_json_object(col("value"), "$.metadata"),
).select("metadata", "payload")
df2.show(false)
+--------+---------------+
|metadata|payload |
+--------+---------------+
|whatever|[1, category-1]|
+--------+---------------+
val df3 = df2.select(to_json(struct(col("metadata"), col("payload"))).as("value"))
df3.show(false)
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"metadata":"whatever","payload":{"catKey":"1","catVal":"category-1"}}|
+----------------------------------------------------------------------+

How to convert dataframe output to json format and then Normalize the data?

I am running a sql and output i am reading as pandas df. Now i need to convert the data in to json and need to normalize the data. I tried to_json but this give partial solution.
Dataframe output:
| SalesPerson | ContactID |
|12345 |Tom|
|12345 |Robin|
|12345 |Julie|
Expected JSON:
{"SalesPerson": "12345", "ContactID":"Tom","Robin","Julie"}
Please see below code which i tried.
q = Select COL1, SalesPerson , ContactIDfrom table;
df = pd.read_sql(q, sqlconn)
df1=df.iloc[:, 1:2]
df2 = df1.to_json(orient='records')
also to_json result bracket which i also dont need.
Try this:
df.groupby('SalesPerson').apply(lambda x: pd.Series({
'ContactID': x['ContactID'].values
})).reset_index().to_json(orient='records')
Output (pretty printed):
[
{
"SalesPerson": 1,
"ContactID": ["Tom", "Robin", "Julie"]
},
{
"SalesPerson": 2,
"ContactID": ["Jack", "Mike", "Mary"]
}
]

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])