Convert multi nested Json to Pandas DataFrame - json

{
"ABC": {
"A": {
"Date": "01/01/2021",
"Value": "0.09"
},
"B": {
"Date": "01/01/2021",
"Value": "0.001"
}
},
"XYZ": {
"A": {
"Date": "01/01/2021",
"Value": "0.006"
},
"B": {
"Date": "01/01/2021",
"Value": "0.000"
}
}
}
Current output after applying pd.json_normalize(x,max_level=1)
Expected Output :
Need to Convert this to pandas DataFrame
If any one can help or give some advice on working with this data that would be great!

Use the following while the js is your input dict:
s = pd.DataFrame(js)
ss = s.apply(lambda x: [pd.Series(y)['Value'] for y in x])
ss['Date'] = s['ABC'].apply(pd.Series)['Date']
result:

One of possible options is custom processing of your x object,
creating a list of rows:
lst = []
for k1, v1 in x.items():
row = {}
row['key'] = k1
for k2, v2 in v1.items():
dd = v2["Date"]
vv = float(v2["Value"])
row['Date'] = dd
row[k2] = vv
lst.append(row)
Note that the above code also converts Value to float type.
I assumed that all dates in each first-level object are the same,
so in the second level loop Date is overwritten, but I assume
that this does no harm.
Then you can create the output DataFrame as follows:
df = pd.DataFrame(lst)
df.set_index('key', inplace=True)
df.index.name = None
The result is:
Date A B
ABC 01/01/2021 0.090 0.001
XYZ 01/01/2021 0.006 0.000
Although it is possible to read x using json_normalize into a temporary
DataFrame, the sequence of operations to convert it to your desired shape
would be complicated.
This is why I came up with the above, in my opinion conceptually simpler solution.

Related

Combine pandas dataframe and additional value into a JSON file

I have created a function to output a JSON file into which I input a pandas data frame and a variable.
Dataframe (df) has 3 columns: ' a', 'b', and 'c'.
df = pd.DataFrame([[1,2,3], [0.1, 0.2, 0.3]], columns=('a','b','c'))
Error is a variable with a float value.
Error = 45
The output format of the JSON file should look like this:
{
"Error": 45,
"Data": [
{
"a": [1, 0.1]
},
{
"b": [2, 0.2]
},
{
"c": [3, 0.3]
},
]
}
I can convert the dataframe into a JSON using the below code. But how can I obtain the desired format of the JSON file?
def OutputFunction(df, Error):
#json_output = df_ViolationSummary.to_json(orient = 'records')
df.to_json(r'C:\Users\Administrator\Downloads\export_dataframe.json', orient = 'records')
## Calling the Function
OutputFunction(df, Error)
calling to_dict(orient='list') will return a dictionary object with each key representing the column and value as the column values in a list. Then you can achieve your desired json object like this: output = {"Error":Error, "Data": df.to_dict(orient='list')}.
Running this line will return:
{'Error': 45, 'Data': {'a': [1.0, 0.1], 'b': [2.0, 0.2], 'c': [3.0, 0.3]}}
Note that the integer values will be in a float format since some of the values in the dataframe are floats, so the the columns' data types become float. If you really wish to have mixed types, you could use some form of mapping/dictionary comprehension as the following, although it should not be necessary for most cases:
output = {
"Error":Error,
"Data": {
col: [
int(v) if v%1 == 0 else v
for v in vals
]
for col,vals in df.to_dict(orient='list').items()
}
}

how to read a multiline nested json in spark scala [duplicate]

This question already has answers here:
Read multiline JSON in Apache Spark
(2 answers)
Closed 2 years ago.
I have a json file as below,
[
{
"WHO": "Joe",
"WEEK": [
{
"NUMBER": 3,
"EXPENSE": [
{
"WHAT": "BEER",
"AMOUNT": 18.00
},
{
"WHAT": "Food",
"AMOUNT": 12.00
},
{
"WHAT": "Food",
"AMOUNT": 19.00
},
{
"WHAT": "Car",
"AMOUNT": 20.00
}
]
}
]
}
]
I executed the below set of code,
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val jsonRDD = sc.wholeTextFiles("/test.json").map(x => x._2)
val jason = sqlContext.read.json(jsonRDD)
jason.show
Output:
It shows WrappedArray in the output. How can we explode the data?
You don't need to read it as wholetextfiles you can just read it as json directly. You just need to specify an option of multiline equal to true to make it work.
val df = spark.read.option("multiLine", true).json("/test.json")
You can see the output as below :
Now to further explode the array columns you can use selectExpr to see each elemet of array as a column as below :
val df1 = df.selectExpr("WHO","Week.Expense[0].amount as Amount","Week.Expense[0].What as What","WEEK.Number as Number")
You can see the output of these as below :
You can also use the combination of select plus explode to do the same thing as below :
val df2 = df.select($"WHO",explode($"Week").as("c1")).select("WHO","c1.Expense","c1.Number","c1.Expense.amount","c1.Expense.what").drop("Expense")
You can see the output as below :

Add string literal into JSONPath output

Can I add a string literal to a JSONPath selector?
{ "items": [
{ "x": 1 },
{ "x": 2 },
{ "x": 3 },
{ "x": 4 }]
}
$.items[:].x gives...
[
1,
2,
3,
4
]
For example, can I make it return...
[
{ 1 },
{ 2 },
{ 3 },
{ 4 }
]
I want to generate some code that adds items to a dictionary.
As discussed in the comments, this cannot be done using JSONPath (alone) since a path query returns only valid JSON and the target format is not valid. In general, JSONPath is not the right tool here, a JSON transformation using a library like Jolt would be more appropriate; but again, similar to XSLT transformations, we can only create valid output. So, as you already have found out, you would need to use string functions to mingle the code as needed. For instance, a regex substitution could do:
const regex = /(\d+),?/gm;
const str = `[
1,
2,
3,
4
]`;
const subst = `{ $1 },`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

Join nested JSON dataframe and another dataframe

I am trying to join a dataframe1 generated by the JSON with dataframe2 using the field order_id, then assign the "status" from dataframe2 to the "status" of dataframe1. Anyone knows how to do this. Many thanks for your help.
dataframe1
[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": "" <== use "status" from dataframe2 to populate this field
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
}
]
dataframe2
order_id status
10002 "Delivered"
10001 "Ordered"
Here is your raw dataset as a json string:
d = """[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
}
]"""
Firstly, I would load it as json:
import json
data = json.loads(d)
Then, I would turn it into a Pandas dataframe, notice that I remove status field as it will be populated by the join step :
df1 = pd.json_normalize(data, 'olist')[['order_id', 'order_dt_tm']]
Then, from the second dataframe sample, I would do a left join using merge function:
data = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data)
result = df1.merge(df2, on='order_id', how='left')
Good luck
UPDATE
# JSON to Dataframe
df1 = pd.json_normalize(data)
# Sub JSON to dataframe
df1['sub_df'] = df1['olist'].apply(lambda x: pd.json_normalize(x).drop('status', axis=1))
# Build second dataframe
data2 = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data2)
# Populates status in sub dataframes
df1['sub_df'] = df1['sub_df'].apply(lambda x: x.merge(df2, on='order_id', how='left').fillna(''))
# Sub dataframes back to JSON
def back_to_json_str(df):
# turns a df back to string json
return str(df.to_json(orient="records", indent=4))
df1['olist'] = df1['sub_df'].apply(lambda x: back_to_json_str(x))
# Global DF back to JSON string
parsed = str(df1.drop('sub_df', axis=1).to_json(orient="records", indent=4))
parsed = parsed.replace(r'\n', '\n')
parsed = parsed.replace(r'\"', '\"')
# Print result
print(parsed)
UPDATE 2
here is a way to add index colum to a dataframe:
df1['index'] = [e for e in range(df1.shape[0])]
This is my code assigning title values from a dataframe back to the JSON object. The assignment operation takes a bit time if the number records in the JSON object is 100000. Anyone knows how to improve the performance of this code. Many thanks.
import json
import random
import pandas as pd
import pydash as _
data = [{"pid":1,"name":"Test1","title":""},{"pid":2,"name":"Test2","title":""}] # 5000 records
# dataframe1
df = pd.json_normalize(data)
# dataframe2
pid = [x for x in range(1, 5000)]
title_set = ["Boss", "CEO", "CFO", "PMO", "Team Lead"]
titles = [title_set[random.randrange(0, 5)] for x in range(1, 5000)]
df2 = pd.DataFrame({'pid': pid, 'title': titles})
#left join dataframe1 and dataframe2
df3 = df.merge(df2, on='pid', how='left')
#assign title values from dataframe back to the json object
for row in df3.iterrows():
idx = _.find_index(data, lambda x: x['pid'] == row[1]['pid'])
data[idx]['title'] = row[1]['title_y']
print(data)

Julia | DataFrame conversion to JSON

I have a dataframe in Julia like df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"]). I have to convert it into a JSON like
{
"nodes": [
{
"A": "1",
"B": "M"
},
{
"A": "2",
"B": "F"
},
{
"A": "3",
"B": "F"
},
{
"A": "4",
"B": "M"
}
]
}
Please help me in this.
There isn't a method in DataFrames to do this. In a github issue where the following snippet, using JSON.jl, is offered as a method to write json:
using JSON
using DataFrames
function df2json(df::DataFrame)
len = length(df[:,1])
indices = names(df)
jsonarray = [Dict([string(index) => (isna(df[index][i])? nothing : df[index][i])
for index in indices])
for i in 1:len]
return JSON.json(jsonarray)
end
function writejson(path::String,df::DataFrame)
open(path,"w") do f
write(f,df2json(df))
end
end
JSONTables package provides JSON conversion to/from Tables.jl-compatible sources like DataFrame.
using DataFrames
using JSONTables
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
jsonstr = objecttable(df)