I am trying to flatten nested dictionaries by using json_normalize.
My data is like this:
data = [
{'gra': [
{
'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'
}
]}
]
I want to get a dataframe like this:
A B C.D C.E date
1 9 1 1 2019-06-27
2 1 1 2 2019-06-27
6 1 1 3 2019-06-27
I tried record_path and meta in the json_normalize, but it keeps giving me an error.
How do you achieve this?
json_normalize does a pretty good job of flatting the object into a
pandas dataframe:
from pandas.io.json import json_normalize
json_normalize(sample_object)
from pandas.io.json import json_normalize
data_ = [item['gra'][0] for item in data] # [{'A': 1, 'B': 9, 'C': {'D': '1', 'E': '1'}, 'date': '2019-06-27'}, {'A': 2, 'B': 1, 'C': {'D': '1', 'E': '2'}, 'date': '2019-06-27'}, {'A': 6, 'B': 1, 'C': {'D': '1', 'E': '3'}, 'date': '2019-06-27'}]
print (json_normalize(data_))
output:
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
It is the most easy way by iterating the list but cannot say it is the best way.
I hope it would solve your problem
data = [{'gra':[{'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'}]},
{'gra':[{'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'}]},
{'gra':[{'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'}]}
]
final_list =[]
for i in data:
temp = dict()
temp['A'] = i['gra'][0]['A']
temp['B'] = i['gra'][0]['B']
temp['C.D'] = i['gra'][0]['C']['D']
temp['C.E'] = i['gra'][0]['C']['E']
temp['date']=i['gra'][0]['date']
final_list.append(temp)
df = pd.DataFrame.from_dict(final_list)
print(df)
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
First we normalized and then hacked our way to produce the required output
import pandas as pd
data = [
{'gra': [
{
'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'
}
]}
]
df = pd.json_normalize(data, 'gra')
cols = ['A','B','C.D','C.E','date']
df = df[cols]
print(df)
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
[Program finished]
Related
Objective: I have fetched in insights data from my Instagram account using Instagram Graph API. Below I have a JSON object (audience_insights['data'].
[{'name': 'audience_city',
'period': 'lifetime',
'values': [{'value': {'London, England': 1,
'Kharkiv, Kharkiv Oblast': 1,
'Jamui, Bihar': 1,
'Burdwan, West Bengal': 1,
'Kolkata, West Bengal': 112,
'Dhulian, West Bengal': 1,
'Argonne, Wisconsin': 1,
'College Park, Georgia': 1,
'Pakaur, Jharkhand': 1,
'Bristol, England': 1,
'Delhi, Delhi': 1,
'Gaya, Bihar': 1,
'Howrah, West Bengal': 1,
'Kanpur, Uttar Pradesh': 1,
'Jaipur, Rajasthan': 2,
'Panipat, Haryana': 1,
'Saint Etienne, Rhône-Alpes': 1,
'Panagarh, West Bengal': 1,
'Bhagalpur, Bihar': 1,
'Frankfurt, Hessen': 1,
'Riyadh, Riyadh Region': 1,
'Roorkee, Uttarakhand': 1,
'Harinavi, West Bengal': 1,
'Secunderabad, Telangana': 1,
'Mumbai, Maharashtra': 3,
'Patna, Bihar': 11,
'Obando, Valle del Cauca': 1,
'Jaunpur, Uttar Pradesh': 1,
'Sitamau, Madhya Pradesh': 1},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Audience City',
'description': "The cities of this profile's followers",
'id': '17841406112341342/insights/audience_city/lifetime'},
{'name': 'audience_country',
'period': 'lifetime',
'values': [{'value': {'DE': 1,
'IN': 144,
'GB': 2,
'UA': 1,
'FR': 1,
'CO': 1,
'SA': 1,
'US': 2},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Audience Country',
'description': "The countries of this profile's followers",
'id': '17841406112341342/insights/audience_country/lifetime'},
{'name': 'audience_gender_age',
'period': 'lifetime',
'values': [{'value': {'F.13-17': 1,
'F.18-24': 20,
'F.25-34': 5,
'M.13-17': 4,
'M.18-24': 79,
'M.25-34': 15,
'M.35-44': 1,
'M.45-54': 3,
'U.13-17': 4,
'U.18-24': 16,
'U.25-34': 2,
'U.45-54': 3},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Gender and Age',
'description': "The gender and age distribution of this profile's followers",
'id': '17841406112341342/insights/audience_gender_age/lifetime'}]
I wish to loop through this and create three data frames:
First that shows demographic and count.
| | Location | Count |
| ---- | -------------- | ----- |
| 0 | London, England | 1 |
Second would be a similar data frame with country and count.
And, finally the last would show the gender and category against the count.
So far, I've been able to extract the three dictionaries that I eventually need to convert to separate data frames. I've stored the dictionaries in a list all_data.
all_data = []
for item in audience_insight['data']:
data = item['values'][0]['value']
all_data.append(data)
df_location = pd.DataFrame(all_data)
all_data
[{'London, England': 1,
'Kharkiv, Kharkiv Oblast': 1,
'Jamui, Bihar': 1,
'Burdwan, West Bengal': 1,
'Kolkata, West Bengal': 112,
'Dhulian, West Bengal': 1,
'Argonne, Wisconsin': 1,
'College Park, Georgia': 1,
'Bristol, England': 1,
'Bikaner, Rajasthan': 1,
'Delhi, Delhi': 1,
'Gaya, Bihar': 1,
'Howrah, West Bengal': 1,
'Jaipur, Rajasthan': 1,
'Kanpur, Uttar Pradesh': 1,
'Panipat, Haryana': 1,
'Saint Etienne, Rhône-Alpes': 1,
'Panagarh, West Bengal': 1,
'Panchagan, Odisha': 1,
'Bhagalpur, Bihar': 1,
'Frankfurt, Hessen': 1,
'Riyadh, Riyadh Region': 1,
'Roorkee, Uttarakhand': 1,
'Harinavi, West Bengal': 1,
'Mumbai, Maharashtra': 3,
'Patna, Bihar': 11,
'Obando, Valle del Cauca': 1,
'Jaunpur, Uttar Pradesh': 1,
'Hyderabad, Telangana': 1,
'Sitamau, Madhya Pradesh': 1},
{'DE': 1, 'IN': 144, 'GB': 2, 'FR': 1, 'CO': 1, 'UA': 1, 'SA': 1, 'US': 2},
{'F.13-17': 1,
'F.18-24': 20,
'F.25-34': 5,
'M.13-17': 4,
'M.18-24': 79,
'M.25-34': 16,
'M.35-44': 1,
'M.45-54': 3,
'U.13-17': 4,
'U.18-24': 15,
'U.25-34': 2,
'U.45-54': 3}]
I want to be able to convert each of these dictionaries into a data frame such that the keys are in the first column and the values are in the second.
Thank you for your help!
I have the following dataframe:
pd.DataFrame({'id':[1,1,1,2,2], 'key': ['a', 'a', 'b', 'a', 'b'], 'value': ['kkk', 'aaa', '5', 'kkk','8']})
I want to convert it to the following data frame:
id value
1 {'a':['kkk', 'aaa'], 'b': 5}
2 {'a':['kkk'], 'b': 8}
I am trying to do this using .to_dict method but the output is
df.groupby(['id','key']).aggregate(list).groupby('id').aggregate(list)
{'value': {1: [['kkk', 'aaa'], ['5']], 2: [['kkk'], ['8']]}}
Should I perform dict comprehension or there is an efficient logic to build such generic json/dict?
After you groupby(['id', 'key']) and agg(list), you can group by the first level of the index and for each group thereof, use droplevel + to_dict:
new_df = df.groupby(['id', 'key']).agg(list).groupby(level=0).apply(lambda x: x['value'].droplevel(0).to_dict()).reset_index(name='value')
Output:
>>> new_df
id value
0 1 {'a': ['kkk', 'aaa'], 'b': ['5']}
1 2 {'a': ['kkk'], 'b': ['8']}
Or, simpler:
new_df = df.groupby('id').apply(lambda x: x.groupby('key')['value'].agg(list).to_dict())
I am converting a nested json file having more than 100 records into a flattend csv file. The sample json file is shown below:
sampleJson = {
'record1':
{
'text':[ ['A', 'fried', 'is', 'a', 'nice', 'companion', '.'],
['The', 'birds', 'are', 'flying', '.']],
'values':[ [0, 1, 0, 0],
[1, 1, 0, 1]],
'pairs':[ [0, 2],
[2, 1]]
},
'record2':
{
'text':[ ['We', 'can', 'work', 'hard', 'together', '.'],
['Let', 'the', 'things', 'happen', '.'],
['There', 'is', 'always', 'a', 'way', 'out', '.']],
'values':[ [0, 1, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]],
'pairs':[ [0, 2],
[3, 4],
[2, 1]]
},
..... 100 records
}
The csv structure i want from this nested json is:
record1, A friend is a nice companion., 0, 1, 0, 0, [0, 2]
, The bids are flying., 1, 1, 0, 1, [2, 1]
record2, We can work hard together., 0, 1, 0, 0, [0, 2]
, Let the things happen., 0, 1, 1, 1, [4, 3]
, There is always a way out., 1, 1, 0, 1, [2, 1]
record3,
....... upto 100 records
I used the following code to flatten the nested file:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flatIt = flatten_json(sampleJson)
df= pd.json_normalize(flatIt)
df.to_csv('outPutFile.csv', encoding='utf-8')
print(df)
I am getting a long list of columns with a structure like record1.text, record1.values, record1.pairs, record2.text and so on with one row and also each word of the sentences in the text is in a separate column.
I will appreciate some help.
Thanks..
You can use this example to parse the Json to dataframe:
import pandas as pd
sampleJson = {
'record1':
{
'text':[ ['A', 'fried', 'is', 'a', 'nice', 'companion', '.'],
['The', 'birds', 'are', 'flying', '.']],
'values':[ [0, 1, 0, 0],
[1, 1, 0, 1]],
'pairs':[ [0, 2],
[2, 1]]
},
'record2':
{
'text':[ ['We', 'can', 'work', 'hard', 'together', '.'],
['Let', 'the', 'things', 'happen', '.'],
['There', 'is', 'always', 'a', 'way', 'out', '.']],
'values':[ [0, 1, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]],
'pairs':[ [0, 2],
[3, 4],
[2, 1]]
},
}
all_data = []
for k, v in sampleJson.items():
texts, values, pairs = v['text'], v['values'], v['pairs']
for t, val, p in zip(texts, values, pairs):
all_data.append({
'record': k,
'text': ' '.join(t),
'pairs': p,
**{'val_{}'.format(i): val_ for i, val_ in enumerate(val, 1)}
})
df = pd.DataFrame(all_data)
print(df)
Prints this dataframe:
record text pairs val_1 val_2 val_3 val_4
0 record1 A fried is a nice companion . [0, 2] 0 1 0 0
1 record1 The birds are flying . [2, 1] 1 1 0 1
2 record2 We can work hard together . [0, 2] 0 1 0 0
3 record2 Let the things happen . [3, 4] 0 1 1 1
4 record2 There is always a way out . [2, 1] 1 1 0 1
I have a large excel file and have multiple worksheets its 100 MB
sheet A
id | name | address
1 | joe | A
2 | gis | B
3 | leo | C
work_1
id| call
1 | 10
1 | 8
2 | 1
3 | 3
work_2
id| call
2 | 4
3 | 8
3 | 7
desired json for each id
data = { id: 1,
address: A,
name: Joe,
log : [{call:10}, {call:8 }]
}
data= { id: 2,
address: B,
name: Gis,
log : [{call:1}, {call:4}]
}
data= { id: 3,
address: C,
name: Leo,
log : [{call:3}, {call:8}, {call:7}]
}
i've tried with pandas but it takes 5 minutes to run it and it only read_excel without any processing. is there any solution to make it faster and how to get desired json?
maybe divide the process into chunk(but pandas removed chunksize for read_excel) and add some threading to make interval so the procees could be print each batch.
You can do:
works=pd.concat([work1,work2],ignore_index=True)
mapper_works=works.groupby('id')[['call']].apply(lambda x: x.to_dict('records'))
dfa['log']=dfa['id'].map(mapper_works)
data=dfa.reindex(columns=['id','address','name','log']).to_dict('records')
print(data)
The output is a list of dict for each id:
[{'id': 1, 'address': 'A', 'name': 'joe', 'log': [{'call': 10}, {'call': 8}]},
{'id': 2, 'address': 'B', 'name': 'gis', 'log': [{'call': 1}, {'call': 4}]},
{'id': 3, 'address': 'C', 'name': 'leo', 'log': [{'call': 3}, {'call': 8}, {'call': 7}]}
]
If you want you can assign to a column:
dfa['dicts']=data
print(dfa)
id name address log \
0 1 joe A [{'call': 10}, {'call': 8}]
1 2 gis B [{'call': 1}, {'call': 4}]
2 3 leo C [{'call': 3}, {'call': 8}, {'call': 7}]
dicts
0 {'id': 1, 'address': 'A', 'name': 'joe', 'log'...
1 {'id': 2, 'address': 'B', 'name': 'gis', 'log'...
2 {'id': 3, 'address': 'C', 'name': 'leo', 'log'...
Given the following two arrays of dictionaries, how can I merge them such that the resulting array of dictionaries contains only those dictionaries whose version is greatest?
data1 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]
data2 = [{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]
The merged result should look like this:
data3 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1},
{'id': 6, 'name': 'Six', 'version': 2}]
How about the following:
//define data1 and Data2
var data1 = new[]{
new {id = 1, name = "Oneeee", version = 2},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Three", version = 2},
new {id = 4, name = "Four", version = 1},
new {id = 5, name ="Five", version = 1}
};
var data2 = new[] {
new {id = 1, name = "One", version = 1},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Threeee", version = 3},
new {id = 6, name = "Six", version = 2}
};
//create a dictionary to handle lookups
var dict1 = data1.ToDictionary (k => k.id);
var dict2 = data2.ToDictionary (k => k.id);
// now query the data
var q = from k in dict1.Keys.Union(dict2.Keys)
select
dict1.ContainsKey(k) ?
(
dict2.ContainsKey(k) ?
(
dict1[k].version > dict2[k].version ? dict1[k] : dict2[k]
) :
dict1[k]
) :
dict2[k];
// convert enumerable back to array
var result = q.ToArray();
Alternative solution that is database friendly if data1 and data2 are tables....
var q = (
from d1 in data1
join d2 in data2 on d1.id equals d2.id into data2j
from d2j in data2j.DefaultIfEmpty()
where d2j == null || d1.version >= d2j.version
select d1
).Union(
from d2 in data2
join d1 in data1 on d2.id equals d1.id into data1j
from d1j in data1j.DefaultIfEmpty()
where d1j == null || d2.version > d1j.version
select d2
);
var result = q.ToArray();
Assume you have some class like (JSON.NET attributes used here):
public class Data
{
[JsonProperty("id")]
public int Id { get; set; }
[JsonProperty("name")]
public string Name { get; set; }
[JsonProperty("version")]
public int Version { get; set; }
}
You can parse you JSON to arrays of this Data objects:
var str1 = #"[{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]";
var str2 = #"[{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]";
var data1 = JsonConvert.DeserializeObject<Data[]>(str1);
var data2 = JsonConvert.DeserializeObject<Data[]>(str2);
So, you can concat these two arrays, group data items by id and select from each group item with highest version:
var data3 = data1.Concat(data2)
.GroupBy(d => d.Id)
.Select(g => g.OrderByDescending(d => d.Version).First())
.ToArray(); // or ToDictionary(d => d.Id)
Result (serialized back to JSON):
[
{ "id": 1, "name": "Oneeee", "version": 2 },
{ "id": 2, "name": "Two", "version": 1 },
{ "id": 3, "name": "Threeee", "version": 3 },
{ "id": 4, "name": "Four", "version": 1 },
{ "id": 5, "name": "Five", "version": 1 },
{ "id": 6, "name": "Six", "version": 2 }
]