dataframe to JSON conversion - json

I have a dataframe that i need to convert to JSON. The data currently looks like
text ids
0 add a car None
1 None 695f1
2 None a86b5c
3 add another car to my log None
4 None 1ba0
5 Concerts None
6 None a4f7
7 None fea
8 None 410
I need the JSON dict to look something like this and ignore the none
{
"text": "add a car",
"ids": [
"695f1",
"a86b5c"
]
}
The steps I have are:
Figured it out.
first set NaN Values where None
df1 = df1.fillna(value=np.nan)
Fill with NaN with previous known value
df1['text'] = df['text'].fillna(method='ffill')
Drop NaN
df1 = df1.dropna()
convert to json
df1.to_json('temp.json', orient='records', lines=True)
The problem I have is the format appears incorrect. I am seeing
{"text":"add a car","ids":" 695f1"}
{"text":"add a car","ids":"a86b5c"}
{"text":"add another car to my log","ids":"1ba0"}
{"text":"Concerts","ids":"a4f7"}
I want :
{
"text": "add a car",
"ids":
[
"695f1",
"a86b5c"
]
}
{
"text": "add another car to my log",
"ids":
[
"1ba0",
],
}
{
"text": "Concerts",
"ids":
[
"a4f7",
"fea",
"410",
]
}

I think you are close, need aggregate list first:
df['text'] = df['text'].ffill()
df = df.dropna()
df1 = df.groupby('text', sort=False).agg(list).reset_index()
print (df1)
text ids
0 add a car [695f1, a86b5c]
1 add another car to my log [1ba0]
2 Concerts [a4f7, fea, 410]
df1.to_json('temp.json', orient='records')

Related

Table from nested list, struct

I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]
flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat is just a StructArray (not a ChunkedArray), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)

pandas json normalize key error with a particular json attribute

I have a json as:
mytestdata = {
"success": True,
"message": "",
"data": {
"totalCount": 95,
"goal": [
{
"user_id": 123455,
"user_email": "john.smith#test.com",
"user_first_name": "John",
"user_last_name": "Smith",
"people_goals": [
{
"goal_id": 545555,
"goal_name": "test goal name",
"goal_owner": "123455",
"goal_narrative": "",
"goal_type": {
"id": 1,
"name": "Team"
},
"goal_create_at": "1595874095",
"goal_modified_at": "1595874095",
"goal_created_by": "123455",
"goal_updated_by": "123455",
"goal_start_date": "1593561600",
"goal_target_date": "1601424000",
"goal_progress": "34",
"goal_progress_color": "#ff9933",
"goal_status": "1",
"goal_permission": "internal,team",
"goal_category": [],
"goal_owner_full_name": "John Smith",
"goal_team_id": "766754",
"goal_team_name": "",
"goal_workstreams": []
}
]
}
]
}
}
I am trying to display all details in "people_goals" along with "user_last_name", "user_first_name","user_email", "user_id" with json_normalize.
So far I am able to display "people_goals", "user_first_name","user_email" with the code
df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email']], errors='ignore')
However I am having issue when trying to include ['goal', 'user_id'] in the meta=[]
The error is:
TypeError Traceback (most recent call last)
<ipython-input-192-b7a124a075a0> in <module>
7 df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
8 meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email'], ['goal','user_id']],
----> 9 errors='ignore')
10
11 # df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'])
The only difference I see for 'user_id' is that it is not a string
Am I missing something here?
Your code works on my platform. I've migrated away from using record_path and meta parameters for two reasons. a) they are difficult to work out b) there are compatibility issues between versions of pandas
Therefore I now use approach of use json_normalize() multiple times to progressively expand JSON. Or use pd.Series. Have included both as examples.
df = pd.json_normalize(data=mytestdata['data']).explode("goal")
df = pd.concat([df, df["goal"].apply(pd.Series)], axis=1).drop(columns="goal").explode("people_goals")
df = pd.concat([df, df["people_goals"].apply(pd.Series)], axis=1).drop(columns="people_goals")
df = pd.concat([df, df["goal_type"].apply(pd.Series)], axis=1).drop(columns="goal_type")
df.T
df2 = pd.json_normalize(pd.json_normalize(
pd.json_normalize(data=mytestdata['data']).explode("goal").to_dict(orient="records")
).explode("goal.people_goals").to_dict(orient="records"))
df2.T
print(df.T.to_string())
output
0
totalCount 95
user_id 123455
user_email john.smith#test.com
user_first_name John
user_last_name Smith
goal_id 545555
goal_name test goal name
goal_owner 123455
goal_narrative
goal_create_at 1595874095
goal_modified_at 1595874095
goal_created_by 123455
goal_updated_by 123455
goal_start_date 1593561600
goal_target_date 1601424000
goal_progress 34
goal_progress_color #ff9933
goal_status 1
goal_permission internal,team
goal_category []
goal_owner_full_name John Smith
goal_team_id 766754
goal_team_name
goal_workstreams []
id 1
name Team

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

How to convert pandas Series to desired JSON format?

I am having the following data on which I need to do apply aggregation function followed by groupby.
My data is as follows: data.csv
id,category,sub_category,count
0,x,sub1,10
1,x,sub2,20
2,x,sub2,10
3,y,sub3,30
4,y,sub3,5
5,y,sub4,15
6,z,sub5,20
Here I'm trying to get the count by sub-category wise. After that I need to store the result in JSON format. The following piece of code helps me in achieving that. test.py
import pandas as pd
df = pd.read_csv('data.csv')
sub_category_total = df['count'].groupby([df['category'], df['sub_category']]).sum()
print sub_category_total.reset_index().to_json(orient = "records")
The above code gives me the following format.
[{"category":"x","sub_category":"sub1","count":10},{"category":"x","sub_category":"sub2","count":30},{"category":"y","sub_category":"sub3","count":35},{"category":"y","sub_category":"sub4","count":15},{"category":"z","sub_category":"sub5","count":20}]
But, my desired format is as follows:
{
"x":[{
"sub_category":"sub1",
"count":10
},
{
"sub_category":"sub2",
"count":30}],
"y":[{
"sub_category":"sub3",
"count":35
},
{
"sub_category":"sub4",
"count":15}],
"z":[{
"sub_category":"sub5",
"count":20}]
}
By following the discussions # How to convert pandas DataFrame result to user defined json format, I replaced the last 2 lines of test.py with,
g = df.groupby('category')[["sub_category","count"]].apply(lambda x: x.to_dict(orient='records'))
print g.to_json()
It gives me the following output.
{"x":[{"count":10,"sub_category":"sub1"},{"count":20,"sub_category":"sub2"},{"count":10,"sub_category":"sub2"}],"y":[{"count":30,"sub_category":"sub3"},{"count":5,"sub_category":"sub3"},{"count":15,"sub_category":"sub4"}],"z":[{"count":20,"sub_category":"sub5"}]}
Though the above result is somewhat similar to my desired format, I couldn't perform any aggregation function over here as it throws error saying 'numpy.int64' object has no attribute 'to_dict'. Hence, I end up getting all of the rows in the data file.
Can somebody help me in achieving the above JSON format?
I think you can first aggregate with sum, parameter as_index=False was added to groupby, so output is Dataframe df1 and then use other solution:
df1 = (df.groupby(['category','sub_category'], as_index=False)['count'].sum())
print (df1)
category sub_category count
0 x sub1 10
1 x sub2 30
2 y sub3 35
3 y sub4 15
4 z sub5 20
g = df1.groupby('category')[["sub_category","count"]]
.apply(lambda x: x.to_dict(orient='records'))
print (g.to_json())
{
"x": [{
"sub_category": "sub1",
"count": 10
}, {
"sub_category": "sub2",
"count": 30
}],
"y": [{
"sub_category": "sub3",
"count": 35
}, {
"sub_category": "sub4",
"count": 15
}],
"z": [{
"sub_category": "sub5",
"count": 20
}]
}

Dataframe in R to be converted to sequence of JSON objects

I had asked the same question after editing 2 times of a previous question I had posted. I am sorry for the bad usage of this website. I have flagged it for deletion and I am posting a proper new question on the same here. Please look into this.
I am basically working on a recommender system code. The output has to be converted to sequence of JSON objects. I have a matrix that has a look up table for every item ID, with the list of the closest items it is related to and the the similarity scores associated with their combinations.
Let me explain through a example.
Suppose I have a matrix
In the below example, Item 1 is similar to Items 22 and 23 with similarity scores 0.8 and 0.5 respectively. And the remaining rows follow the same structure.
X1 X2 X3 X4 X5
1 22 23 0.8 0.5
34 4 87 0.4 0.4
23 7 92 0.6 0.5
I want a JSON structure for every item (every X1 for every row) along with the recommended items and the similarity scores for each combination as a separate JSON entity and this being done in sequence. I don't want an entire JSON object containing these individual ones.
Assume there is one more entity called "coid" that will be given as input to the code. I assume it is XYZ and it is same for all the rows.
{ "_id" : { "coid" : "XYZ", "iid" : "1"}, "items" : [ { "item" : "22", "score" : 0.8},{ "item": "23", "score" : 0.5}] }
{ "_id" : { "coid" : "XYZ", "iid" : "34"},"items" : [ { "item" : "4", "score" : 0.4},{ "item": "87", "score" : 0.4}] }
{ "_id" : { "coid" : "XYZ", "iid" : "23"},"items" : [ { "item" : "7", "score" : 0.6},{ "item": "92", "score" : 0.5}] }
As in the above, each entity is a valid JSON structure/object but they are not put together into a separate JSON object as a whole.
I appreciate all the help done for the previous question but somehow I feel this new alteration I have here is not related to them because in the end, if you do a toJSON(some entity), then it converts the entire thing to one JSON object. I don't want that.
I want individual ones like these to be written to a file.
I am very sorry for my ignorance and inconvenience. Please help.
Thanks.
library(rjson)
## Your matrix
mat <- matrix(c(1,34,23,
22, 4, 7,
23,87,92,
0.8, 0.4, 0.6,
0.5, 0.4, 0.5), byrow=FALSE, nrow=3)
I use a function (not very interesting name makejson) that takes a row of the matrix and returns a JSON object. It makes two list objects, _id and items, and combines them to a JSON object
makejson <- function(x, coid="ABC") {
`_id` <- list(coid = coid, iid=x[1])
nitem <- (length(x) - 1) / 2 # Number of items
items <- list()
for(i in seq(1, nitem)) {
items[[i]] <- list(item = x[i + 1], score = x[i + 1 + nitem])
}
toJSON(list(`_id`=`_id`, items=items))
}
Then using apply (or a for loop) I use the function for each row of the matrix.
res <- apply(mat, 1, makejson, coid="XYZ")
cat(res, sep = "\n")
## {"_id":{"coid":"XYZ","iid":1},"items":[{"item":22,"score":0.8},{"item":23,"score":0.5}]}
## {"_id":{"coid":"XYZ","iid":34},"items":[{"item":4,"score":0.4},{"item":87,"score":0.4}]}
## {"_id":{"coid":"XYZ","iid":23},"items":[{"item":7,"score":0.6},{"item":92,"score":0.5}]}
The result can be saved to a file with cat by specifying the file argument.
## cat(res, sep="\n", file="out.json")
There is a small difference in your output and mine, the numbers are in quotes ("). If you want to have it like that, mat has to be character.
## mat <- matrix(as.character(c(1,34,23, ...
Hope it helps,
alex