How to convert pandas Series to desired JSON format? - json

I am having the following data on which I need to do apply aggregation function followed by groupby.
My data is as follows: data.csv
id,category,sub_category,count
0,x,sub1,10
1,x,sub2,20
2,x,sub2,10
3,y,sub3,30
4,y,sub3,5
5,y,sub4,15
6,z,sub5,20
Here I'm trying to get the count by sub-category wise. After that I need to store the result in JSON format. The following piece of code helps me in achieving that. test.py
import pandas as pd
df = pd.read_csv('data.csv')
sub_category_total = df['count'].groupby([df['category'], df['sub_category']]).sum()
print sub_category_total.reset_index().to_json(orient = "records")
The above code gives me the following format.
[{"category":"x","sub_category":"sub1","count":10},{"category":"x","sub_category":"sub2","count":30},{"category":"y","sub_category":"sub3","count":35},{"category":"y","sub_category":"sub4","count":15},{"category":"z","sub_category":"sub5","count":20}]
But, my desired format is as follows:
{
"x":[{
"sub_category":"sub1",
"count":10
},
{
"sub_category":"sub2",
"count":30}],
"y":[{
"sub_category":"sub3",
"count":35
},
{
"sub_category":"sub4",
"count":15}],
"z":[{
"sub_category":"sub5",
"count":20}]
}
By following the discussions # How to convert pandas DataFrame result to user defined json format, I replaced the last 2 lines of test.py with,
g = df.groupby('category')[["sub_category","count"]].apply(lambda x: x.to_dict(orient='records'))
print g.to_json()
It gives me the following output.
{"x":[{"count":10,"sub_category":"sub1"},{"count":20,"sub_category":"sub2"},{"count":10,"sub_category":"sub2"}],"y":[{"count":30,"sub_category":"sub3"},{"count":5,"sub_category":"sub3"},{"count":15,"sub_category":"sub4"}],"z":[{"count":20,"sub_category":"sub5"}]}
Though the above result is somewhat similar to my desired format, I couldn't perform any aggregation function over here as it throws error saying 'numpy.int64' object has no attribute 'to_dict'. Hence, I end up getting all of the rows in the data file.
Can somebody help me in achieving the above JSON format?

I think you can first aggregate with sum, parameter as_index=False was added to groupby, so output is Dataframe df1 and then use other solution:
df1 = (df.groupby(['category','sub_category'], as_index=False)['count'].sum())
print (df1)
category sub_category count
0 x sub1 10
1 x sub2 30
2 y sub3 35
3 y sub4 15
4 z sub5 20
g = df1.groupby('category')[["sub_category","count"]]
.apply(lambda x: x.to_dict(orient='records'))
print (g.to_json())
{
"x": [{
"sub_category": "sub1",
"count": 10
}, {
"sub_category": "sub2",
"count": 30
}],
"y": [{
"sub_category": "sub3",
"count": 35
}, {
"sub_category": "sub4",
"count": 15
}],
"z": [{
"sub_category": "sub5",
"count": 20
}]
}

Related

Postgres json to view

I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]

pandas json normalize key error with a particular json attribute

I have a json as:
mytestdata = {
"success": True,
"message": "",
"data": {
"totalCount": 95,
"goal": [
{
"user_id": 123455,
"user_email": "john.smith#test.com",
"user_first_name": "John",
"user_last_name": "Smith",
"people_goals": [
{
"goal_id": 545555,
"goal_name": "test goal name",
"goal_owner": "123455",
"goal_narrative": "",
"goal_type": {
"id": 1,
"name": "Team"
},
"goal_create_at": "1595874095",
"goal_modified_at": "1595874095",
"goal_created_by": "123455",
"goal_updated_by": "123455",
"goal_start_date": "1593561600",
"goal_target_date": "1601424000",
"goal_progress": "34",
"goal_progress_color": "#ff9933",
"goal_status": "1",
"goal_permission": "internal,team",
"goal_category": [],
"goal_owner_full_name": "John Smith",
"goal_team_id": "766754",
"goal_team_name": "",
"goal_workstreams": []
}
]
}
]
}
}
I am trying to display all details in "people_goals" along with "user_last_name", "user_first_name","user_email", "user_id" with json_normalize.
So far I am able to display "people_goals", "user_first_name","user_email" with the code
df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email']], errors='ignore')
However I am having issue when trying to include ['goal', 'user_id'] in the meta=[]
The error is:
TypeError Traceback (most recent call last)
<ipython-input-192-b7a124a075a0> in <module>
7 df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
8 meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email'], ['goal','user_id']],
----> 9 errors='ignore')
10
11 # df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'])
The only difference I see for 'user_id' is that it is not a string
Am I missing something here?
Your code works on my platform. I've migrated away from using record_path and meta parameters for two reasons. a) they are difficult to work out b) there are compatibility issues between versions of pandas
Therefore I now use approach of use json_normalize() multiple times to progressively expand JSON. Or use pd.Series. Have included both as examples.
df = pd.json_normalize(data=mytestdata['data']).explode("goal")
df = pd.concat([df, df["goal"].apply(pd.Series)], axis=1).drop(columns="goal").explode("people_goals")
df = pd.concat([df, df["people_goals"].apply(pd.Series)], axis=1).drop(columns="people_goals")
df = pd.concat([df, df["goal_type"].apply(pd.Series)], axis=1).drop(columns="goal_type")
df.T
df2 = pd.json_normalize(pd.json_normalize(
pd.json_normalize(data=mytestdata['data']).explode("goal").to_dict(orient="records")
).explode("goal.people_goals").to_dict(orient="records"))
df2.T
print(df.T.to_string())
output
0
totalCount 95
user_id 123455
user_email john.smith#test.com
user_first_name John
user_last_name Smith
goal_id 545555
goal_name test goal name
goal_owner 123455
goal_narrative
goal_create_at 1595874095
goal_modified_at 1595874095
goal_created_by 123455
goal_updated_by 123455
goal_start_date 1593561600
goal_target_date 1601424000
goal_progress 34
goal_progress_color #ff9933
goal_status 1
goal_permission internal,team
goal_category []
goal_owner_full_name John Smith
goal_team_id 766754
goal_team_name
goal_workstreams []
id 1
name Team

How to convert dataframe output to json format and then Normalize the data?

I am running a sql and output i am reading as pandas df. Now i need to convert the data in to json and need to normalize the data. I tried to_json but this give partial solution.
Dataframe output:
| SalesPerson | ContactID |
|12345 |Tom|
|12345 |Robin|
|12345 |Julie|
Expected JSON:
{"SalesPerson": "12345", "ContactID":"Tom","Robin","Julie"}
Please see below code which i tried.
q = Select COL1, SalesPerson , ContactIDfrom table;
df = pd.read_sql(q, sqlconn)
df1=df.iloc[:, 1:2]
df2 = df1.to_json(orient='records')
also to_json result bracket which i also dont need.
Try this:
df.groupby('SalesPerson').apply(lambda x: pd.Series({
'ContactID': x['ContactID'].values
})).reset_index().to_json(orient='records')
Output (pretty printed):
[
{
"SalesPerson": 1,
"ContactID": ["Tom", "Robin", "Julie"]
},
{
"SalesPerson": 2,
"ContactID": ["Jack", "Mike", "Mary"]
}
]

How to take any CSV file and convert it to JSON?(with python as a script engine) [Novice user trying to learn NiFi]

1) There is a CSV file containing the following information (the first row is the header):
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
[
{
"first": 1,
"second": 4,
"third": 9,
"total": 14
},
{
"first": 7,
"second": 5,
"third": 2,
"total": 14
},
{
"first": 3,
"second": 8,
"third": 7,
"total": 18
}
]
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
NiFi flow:
Although i'm not into Python, by just googling around i think this might do it:
import csv
with open("YOURFILE.csv") as f:
reader = csv.DictReader(f)
data = [r for r in reader]
import json
with open('result.json', 'w') as outfile:
json.dump(data, outfile)
You can use Query Record processor and add new property as
total
select first,second,third,first+second+third total from FLOWFILE
Configure the CsvReader controller service with matching avro schema with int as datatype for all the fields and Json Setwriter controller service,Include total field name so that the output from Query Record processor will be all the columns and the sum of the columns as total.
Connect total relationship from Query Record processor for further processing
Refer to these links regarding Query Record and Configure Record Reader/Writer

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])