How to Change a value in a Dataframe based on a lookup from a json file - json

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!

for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf

Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

Related

Postgres json to view

I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]

Using Pandas to Flatten a JSON with a nested array

Have the following JSON. I want to pullout task flatten it and put into own data frame and include the ID from the parent
[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"taskId":898989,
"status":"Closed"
},
{
"assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
"resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
"taskId":999999
}
],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]
I would like to get a dataframe from tasks like so
id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status
I have flattened the entire dataframe using
df=pd.json_normalize(json.loads(df.to_json(orient='records')))
It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.
I have id and tasks in a dataframe like so
tasksdf=storiesdf[['tasks','id']]
then i want to normalize it like
tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))
but I know since it is in an array I need to do something different. However I have not been able to figure it out. I have been looking at other examples and reading what others have done. Any help would be appreciated.
The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.
Secondly, some columns are redundant between assignee, resolvedBy and the nested task. I would therefore create the assignee.id, resolved.id...etc columns first and merge them with the normalized task:
json_data = json.loads(json_str)
df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')
df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)
df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)
df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)
print(df.drop_duplicates().reset_index(drop=True))
Output:
id state assignee.id assignee.firstName ... resolvedBy.firstName resolvedBy.lastName taskId status
0 123456.0 Complete 5757 Jim ... Jim Johnson 898989.0 Closed
1 123477.0 Inprogress 8576 Jack ... None None NaN NaN
2 123456 Complete 5857 Nacy ... George Johnson 999999.0 NaN

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

Read JSON to pandas dataframe - Getting ValueError: Mixing dicts with non-Series may lead to ambiguous ordering

I am trying to read in the JSON structure below into pandas dataframe, but it throws out the error message:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
Json data:
'''
{
"Name": "Bob",
"Mobile": 12345678,
"Boolean": true,
"Pets": ["Dog", "cat"],
"Address": {
"Permanent Address": "USA",
"Current Address": "UK"
},
"Favorite Books": {
"Non-fiction": "Outliers",
"Fiction": {"Classic Literature": "The Old Man and the Sea"}
}
}
'''
How do I get this right? I have tried the script below...
'''
j_df = pd.read_json('json_file.json')
j_df
with open(j_file) as jsonfile:
data = json.load(jsonfile)
'''
Read json from file first and pass to json_normalize with DataFrame.explode:
import json
with open('json_file.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(j).explode('Pets').reset_index(drop=True)
print (df)
Name Mobile Boolean Pets Address.Permanent Address \
0 Bob 12345678 True Dog USA
1 Bob 12345678 True cat USA
Address.Current Address Favorite Books.Non-fiction \
0 UK Outliers
1 UK Outliers
Favorite Books.Fiction.Classic Literature
0 The Old Man and the Sea
1 The Old Man and the Sea
EDIT: For write values to sentence you can select necessary columns, remove duplicates, create numpy array and loop:
for x, y in df[['Name','Favorite Books.Fiction.Classic Literature']].drop_duplicates().to_numpy():
print (f"{x}’s favorite classical iterature book is {y}.")
Bob’s favorite classical iterature book is The Old Man and the Sea.

How to convert pandas Series to desired JSON format?

I am having the following data on which I need to do apply aggregation function followed by groupby.
My data is as follows: data.csv
id,category,sub_category,count
0,x,sub1,10
1,x,sub2,20
2,x,sub2,10
3,y,sub3,30
4,y,sub3,5
5,y,sub4,15
6,z,sub5,20
Here I'm trying to get the count by sub-category wise. After that I need to store the result in JSON format. The following piece of code helps me in achieving that. test.py
import pandas as pd
df = pd.read_csv('data.csv')
sub_category_total = df['count'].groupby([df['category'], df['sub_category']]).sum()
print sub_category_total.reset_index().to_json(orient = "records")
The above code gives me the following format.
[{"category":"x","sub_category":"sub1","count":10},{"category":"x","sub_category":"sub2","count":30},{"category":"y","sub_category":"sub3","count":35},{"category":"y","sub_category":"sub4","count":15},{"category":"z","sub_category":"sub5","count":20}]
But, my desired format is as follows:
{
"x":[{
"sub_category":"sub1",
"count":10
},
{
"sub_category":"sub2",
"count":30}],
"y":[{
"sub_category":"sub3",
"count":35
},
{
"sub_category":"sub4",
"count":15}],
"z":[{
"sub_category":"sub5",
"count":20}]
}
By following the discussions # How to convert pandas DataFrame result to user defined json format, I replaced the last 2 lines of test.py with,
g = df.groupby('category')[["sub_category","count"]].apply(lambda x: x.to_dict(orient='records'))
print g.to_json()
It gives me the following output.
{"x":[{"count":10,"sub_category":"sub1"},{"count":20,"sub_category":"sub2"},{"count":10,"sub_category":"sub2"}],"y":[{"count":30,"sub_category":"sub3"},{"count":5,"sub_category":"sub3"},{"count":15,"sub_category":"sub4"}],"z":[{"count":20,"sub_category":"sub5"}]}
Though the above result is somewhat similar to my desired format, I couldn't perform any aggregation function over here as it throws error saying 'numpy.int64' object has no attribute 'to_dict'. Hence, I end up getting all of the rows in the data file.
Can somebody help me in achieving the above JSON format?
I think you can first aggregate with sum, parameter as_index=False was added to groupby, so output is Dataframe df1 and then use other solution:
df1 = (df.groupby(['category','sub_category'], as_index=False)['count'].sum())
print (df1)
category sub_category count
0 x sub1 10
1 x sub2 30
2 y sub3 35
3 y sub4 15
4 z sub5 20
g = df1.groupby('category')[["sub_category","count"]]
.apply(lambda x: x.to_dict(orient='records'))
print (g.to_json())
{
"x": [{
"sub_category": "sub1",
"count": 10
}, {
"sub_category": "sub2",
"count": 30
}],
"y": [{
"sub_category": "sub3",
"count": 35
}, {
"sub_category": "sub4",
"count": 15
}],
"z": [{
"sub_category": "sub5",
"count": 20
}]
}