read large file excel with multiple worksheet to json with python - json

I have a large excel file and have multiple worksheets its 100 MB
sheet A
id | name | address
1 | joe | A
2 | gis | B
3 | leo | C
work_1
id| call
1 | 10
1 | 8
2 | 1
3 | 3
work_2
id| call
2 | 4
3 | 8
3 | 7
desired json for each id
data = { id: 1,
address: A,
name: Joe,
log : [{call:10}, {call:8 }]
}
data= { id: 2,
address: B,
name: Gis,
log : [{call:1}, {call:4}]
}
data= { id: 3,
address: C,
name: Leo,
log : [{call:3}, {call:8}, {call:7}]
}
i've tried with pandas but it takes 5 minutes to run it and it only read_excel without any processing. is there any solution to make it faster and how to get desired json?
maybe divide the process into chunk(but pandas removed chunksize for read_excel) and add some threading to make interval so the procees could be print each batch.

You can do:
works=pd.concat([work1,work2],ignore_index=True)
mapper_works=works.groupby('id')[['call']].apply(lambda x: x.to_dict('records'))
dfa['log']=dfa['id'].map(mapper_works)
data=dfa.reindex(columns=['id','address','name','log']).to_dict('records')
print(data)
The output is a list of dict for each id:
[{'id': 1, 'address': 'A', 'name': 'joe', 'log': [{'call': 10}, {'call': 8}]},
{'id': 2, 'address': 'B', 'name': 'gis', 'log': [{'call': 1}, {'call': 4}]},
{'id': 3, 'address': 'C', 'name': 'leo', 'log': [{'call': 3}, {'call': 8}, {'call': 7}]}
]
If you want you can assign to a column:
dfa['dicts']=data
print(dfa)
id name address log \
0 1 joe A [{'call': 10}, {'call': 8}]
1 2 gis B [{'call': 1}, {'call': 4}]
2 3 leo C [{'call': 3}, {'call': 8}, {'call': 7}]
dicts
0 {'id': 1, 'address': 'A', 'name': 'joe', 'log'...
1 {'id': 2, 'address': 'B', 'name': 'gis', 'log'...
2 {'id': 3, 'address': 'C', 'name': 'leo', 'log'...

Related

How to extract dict columns with similar names from json array in pandas dataframe

I have a dataframe of 20k rows x 45 columns that has been normalized nearly fully, but I have one pesky column in particular.
I have copied just the index and the problem column, omitting the other 44 columns for simplicity in data display.
agencies
0 [{'id': 29, 'name': 'Air Force, Dept of'}, {'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'}, {'id': 1, 'name': 'SENATE'}]
1 [{'id': 29, 'name': 'Air Force, Dept of'}, {'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'}, {'id': 1, 'name': 'SENATE'}]
2 [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'}, {'id': 1, 'name': 'SENATE'}]
3 [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'}, {'id': 1, 'name': 'SENATE'}]
4 [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'}, {'id': 1, 'name': 'SENATE'}]
Here, I would like to extract each of the values within the name label, however, they all have the same name so json_normalize() puts them all in the same column, and lengthens the dataset by however many entries are in each array.
I would like to extract them into name_1, name_2, name_3, ... , name_max_amount_of_names. So let's suppose the max amount of name entries in the column is 5, I would like to have:
name_1, name_2, name_3, name_4, name_5.
I have tried normalization and cannot figure this out further.
Thank you in advance.
EDIT:
Thanks to the kind commenter below, I'm close, however, it seems to be creating a new column for each unique 'name', and that's not what I was trying to accomplish as it clutters the data with many NaN's.
I have included a screenshot of the results.
Try as follows.
We apply Series.explode to get each item from each list on a separate row (but still with the appropriate index number).
We wrap this result inside pd.json_normalize to get a flat table.
We now need to set a new index (with apply(pd.Series) we wouldn't have this problem) with the exploded index values (so: .set_index(df.agencies.explode().index)).
Finally, we use df.pivot to get the data in the correct shape.
Now, we are basically done, except for renaming the df.columns.
import pandas as pd
data = {'agencies':
{0: [{'id': 29, 'name': 'Air Force, Dept of'},
{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'},
{'id': 1, 'name': 'SENATE'}],
1: [{'id': 29, 'name': 'Air Force, Dept of'},
{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'},
{'id': 1, 'name': 'SENATE'}],
2: [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'},
{'id': 1, 'name': 'SENATE'}],
3: [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'},
{'id': 1, 'name': 'SENATE'}],
4: [{'id': 2, 'name': 'HOUSE OF REPRESENTATIVES'},
{'id': 1, 'name': 'SENATE'}]}}
df = pd.DataFrame(data)
df_names = pd.json_normalize(df.agencies.explode())\
.set_index(df.agencies.explode().index).pivot(
index=None,columns='id', values='name')
# order of column names will be:
# sorted(pd.json_normalize(df.agencies.explode())\
# .set_index(df.agencies.explode().index)['id'].unique())
# i.e.: [1, 2, 29]
# (reorder them as appropriate, and then) overwrite as name_1, name_2, name_3
df_names.columns = [f'name_{idx}' for idx in range(1, len(df_names.columns)+1)]
print(df_names)
name_1 name_2 name_3
0 SENATE HOUSE OF REPRESENTATIVES Air Force, Dept of
1 SENATE HOUSE OF REPRESENTATIVES Air Force, Dept of
2 SENATE HOUSE OF REPRESENTATIVES NaN
3 SENATE HOUSE OF REPRESENTATIVES NaN
4 SENATE HOUSE OF REPRESENTATIVES NaN
# assignment to orig df would be:
# df = pd.concat([df,df_names],axis=1)
Update
The OP has updated the question. Let's produce a small example to clarify the apparent problem. The adjusted data is as follows:
import pandas as pd
data = {'agencies':
{0: [{'id': 29, 'name': 'Air Force, Dept of 29'},
{'id': 1, 'name': 'SENATE'},
{'id': 4, 'name': 'Air Force, Dept of 4'},],
1: [{'id': 2, 'name': 'Air Force, Dept of 2'},
{'id': 1, 'name': 'SENATE'}]
}}
So, here we have 3 unmatched key-value pairs: 'id': 2, 4, and 29. Applying the method described above, we will end up with this:
name_1 name_2 name_3 name_4
0 SENATE NaN Air Force, Dept of 4 Air Force, Dept of 29
1 SENATE Air Force, Dept of 2 NaN NaN
Here, the names associated with id: 1 work fine (name_1), because this key is found in both lists of dicts. However, the other name keys all lack a "match" in the other list, so that they end up with their own column in consecutively order (based on the ids). I.e. name_2 fills names associated with 'id': 2, then name_3 for 4, and name_4 for 29.
If I understand the update correctly, the OP rather wishes to "use up" each new consecutive name column with name-keys as much as possible, before creating a new column. I.e., in the current example, this would mean that name_2 is to be filled with the name for 'id': 4 in row 0, and 'id': 2 in row 1. And then only the name for 'id': 29 will get its own column (name_3), since name_2 is already "full". We can achieve this quite easily by adding an intermediate step:
df = pd.DataFrame(data)
first = pd.json_normalize(df.agencies.explode())
second = first.set_index(df.agencies.explode().index)
# rank all `ids` per group, and overwrite the original `ids`
# i.e. [1, 4, 29] -> [1, 2, 3]
second['id'] = second.groupby(level=0)['id'].rank()
final = second.pivot(index=None,columns='id', values='name')
final.columns = [f'name_{idx}' for idx in range(1, len(final.columns)+1)]
print(final)
name_1 name_2 name_3
0 SENATE Air Force, Dept of 4 Air Force, Dept of 29
1 SENATE Air Force, Dept of 2 NaN

Creating a pandas data frame from a JSON object

Objective: I have fetched in insights data from my Instagram account using Instagram Graph API. Below I have a JSON object (audience_insights['data'].
[{'name': 'audience_city',
'period': 'lifetime',
'values': [{'value': {'London, England': 1,
'Kharkiv, Kharkiv Oblast': 1,
'Jamui, Bihar': 1,
'Burdwan, West Bengal': 1,
'Kolkata, West Bengal': 112,
'Dhulian, West Bengal': 1,
'Argonne, Wisconsin': 1,
'College Park, Georgia': 1,
'Pakaur, Jharkhand': 1,
'Bristol, England': 1,
'Delhi, Delhi': 1,
'Gaya, Bihar': 1,
'Howrah, West Bengal': 1,
'Kanpur, Uttar Pradesh': 1,
'Jaipur, Rajasthan': 2,
'Panipat, Haryana': 1,
'Saint Etienne, Rhône-Alpes': 1,
'Panagarh, West Bengal': 1,
'Bhagalpur, Bihar': 1,
'Frankfurt, Hessen': 1,
'Riyadh, Riyadh Region': 1,
'Roorkee, Uttarakhand': 1,
'Harinavi, West Bengal': 1,
'Secunderabad, Telangana': 1,
'Mumbai, Maharashtra': 3,
'Patna, Bihar': 11,
'Obando, Valle del Cauca': 1,
'Jaunpur, Uttar Pradesh': 1,
'Sitamau, Madhya Pradesh': 1},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Audience City',
'description': "The cities of this profile's followers",
'id': '17841406112341342/insights/audience_city/lifetime'},
{'name': 'audience_country',
'period': 'lifetime',
'values': [{'value': {'DE': 1,
'IN': 144,
'GB': 2,
'UA': 1,
'FR': 1,
'CO': 1,
'SA': 1,
'US': 2},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Audience Country',
'description': "The countries of this profile's followers",
'id': '17841406112341342/insights/audience_country/lifetime'},
{'name': 'audience_gender_age',
'period': 'lifetime',
'values': [{'value': {'F.13-17': 1,
'F.18-24': 20,
'F.25-34': 5,
'M.13-17': 4,
'M.18-24': 79,
'M.25-34': 15,
'M.35-44': 1,
'M.45-54': 3,
'U.13-17': 4,
'U.18-24': 16,
'U.25-34': 2,
'U.45-54': 3},
'end_time': '2022-03-24T07:00:00+0000'}],
'title': 'Gender and Age',
'description': "The gender and age distribution of this profile's followers",
'id': '17841406112341342/insights/audience_gender_age/lifetime'}]
I wish to loop through this and create three data frames:
First that shows demographic and count.
| | Location | Count |
| ---- | -------------- | ----- |
| 0 | London, England | 1 |
Second would be a similar data frame with country and count.
And, finally the last would show the gender and category against the count.
So far, I've been able to extract the three dictionaries that I eventually need to convert to separate data frames. I've stored the dictionaries in a list all_data.
all_data = []
for item in audience_insight['data']:
data = item['values'][0]['value']
all_data.append(data)
df_location = pd.DataFrame(all_data)
all_data
[{'London, England': 1,
'Kharkiv, Kharkiv Oblast': 1,
'Jamui, Bihar': 1,
'Burdwan, West Bengal': 1,
'Kolkata, West Bengal': 112,
'Dhulian, West Bengal': 1,
'Argonne, Wisconsin': 1,
'College Park, Georgia': 1,
'Bristol, England': 1,
'Bikaner, Rajasthan': 1,
'Delhi, Delhi': 1,
'Gaya, Bihar': 1,
'Howrah, West Bengal': 1,
'Jaipur, Rajasthan': 1,
'Kanpur, Uttar Pradesh': 1,
'Panipat, Haryana': 1,
'Saint Etienne, Rhône-Alpes': 1,
'Panagarh, West Bengal': 1,
'Panchagan, Odisha': 1,
'Bhagalpur, Bihar': 1,
'Frankfurt, Hessen': 1,
'Riyadh, Riyadh Region': 1,
'Roorkee, Uttarakhand': 1,
'Harinavi, West Bengal': 1,
'Mumbai, Maharashtra': 3,
'Patna, Bihar': 11,
'Obando, Valle del Cauca': 1,
'Jaunpur, Uttar Pradesh': 1,
'Hyderabad, Telangana': 1,
'Sitamau, Madhya Pradesh': 1},
{'DE': 1, 'IN': 144, 'GB': 2, 'FR': 1, 'CO': 1, 'UA': 1, 'SA': 1, 'US': 2},
{'F.13-17': 1,
'F.18-24': 20,
'F.25-34': 5,
'M.13-17': 4,
'M.18-24': 79,
'M.25-34': 16,
'M.35-44': 1,
'M.45-54': 3,
'U.13-17': 4,
'U.18-24': 15,
'U.25-34': 2,
'U.45-54': 3}]
I want to be able to convert each of these dictionaries into a data frame such that the keys are in the first column and the values are in the second.
Thank you for your help!

Access multiple dictionaries inside a list of a json file

I am trying to create a dataframe from a json code. But I cannot access multiple objects inside a list. Only the first value is being retrieved.
This is my json code:
[{'id': '1', 'fnamae': 'Rasab', 'lname': 'Asdaf', 'Age': 21, 'Language': ['python', 'json'], 'parents': {'mother': {'name': 'Mrs. Mother', 'phone': '1212121212'}, 'father': {'name': 'Mr. Father', 'phone': '1212121212'}}, 'siblings': [{'name': 'jamuna', 'phone': 564851312}, {'name': 'Killana', 'phone': 1212121212}]}, {'id': '2', 'fnamae': 'Muddassir', 'lname': 'Jameel', 'Age': 25, 'Language': ['React', 'json'], 'parents': {'mother': {'name': 'Mrs. Mutherinlaw', 'phone': 9654512}, 'father': {'name': 'Mr. Futherinlaw', 'phone': 53154278}}, 'siblings': [{'name': 'Giallan', 'phone': 998742568}, {'name': 'Simba', 'phone': 12355875}]}, {'id': '3', 'fnamae': 'Farhan', 'lname': 'Akhtar', 'Age': 25, 'Language': ['Drupal', 'PHP'], 'parents': {'mother': {'name': 'Heung min son', 'phone': 89546487}, 'father': {'name': 'Kane', 'phone': 4564823545}}, 'siblings': [{'name': 'Xamcs', 'phone': 78654325}, {'name': 'sinfbad', 'phone': 45648232}]}]
And this is my code to access "siblings" list from the json files to create a dataframe.
s = l['siblings']
df2 = pd.DataFrame(s.str[0].values.tolist())
df2
But the output is:
name phone
0 jamuna 564851312
1 Giallan 998742568
2 Xamcs 78654325
My expected output would be to include the multiple names and phone numbers of the siblings.
name phone
0 [jamuna,Killana] 564851312,468451
1 [Giallan,Simba] 998742568,654684
2 [Xamcs, sinfbad] 786543254,654654
When I change my code to s.str[1] I am able to retrieve the second set of list. But how do I iterate over them
You're going to have to do a nested list comprehension:
import pandas as pd
pd.DataFrame(
{
key: [[j[key] for j in i["siblings"]] for i in json_content]
for key in ["name", "phone"]
}
)
This will give you
| | name | phone |
|---:|:----------------------|:------------------------|
| 0 | ['jamuna', 'Killana'] | [564851312, 1212121212] |
| 1 | ['Giallan', 'Simba'] | [998742568, 12355875] |
| 2 | ['Xamcs', 'sinfbad'] | [78654325, 45648232] |
Use a list comprehension to derive the output
pd.DataFrame([d for l in json_content for d in l['siblings']])

pandas json_normalize flatten nested dictionaries

I am trying to flatten nested dictionaries by using json_normalize.
My data is like this:
data = [
{'gra': [
{
'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'
}
]}
]
I want to get a dataframe like this:
A B C.D C.E date
1 9 1 1 2019-06-27
2 1 1 2 2019-06-27
6 1 1 3 2019-06-27
I tried record_path and meta in the json_normalize, but it keeps giving me an error.
How do you achieve this?
json_normalize does a pretty good job of flatting the object into a
pandas dataframe:
from pandas.io.json import json_normalize
json_normalize(sample_object)
from pandas.io.json import json_normalize
data_ = [item['gra'][0] for item in data] # [{'A': 1, 'B': 9, 'C': {'D': '1', 'E': '1'}, 'date': '2019-06-27'}, {'A': 2, 'B': 1, 'C': {'D': '1', 'E': '2'}, 'date': '2019-06-27'}, {'A': 6, 'B': 1, 'C': {'D': '1', 'E': '3'}, 'date': '2019-06-27'}]
print (json_normalize(data_))
output:
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
It is the most easy way by iterating the list but cannot say it is the best way.
I hope it would solve your problem
data = [{'gra':[{'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'}]},
{'gra':[{'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'}]},
{'gra':[{'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'}]}
]
final_list =[]
for i in data:
temp = dict()
temp['A'] = i['gra'][0]['A']
temp['B'] = i['gra'][0]['B']
temp['C.D'] = i['gra'][0]['C']['D']
temp['C.E'] = i['gra'][0]['C']['E']
temp['date']=i['gra'][0]['date']
final_list.append(temp)
df = pd.DataFrame.from_dict(final_list)
print(df)
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
First we normalized and then hacked our way to produce the required output
import pandas as pd
data = [
{'gra': [
{
'A': 1,
'B': 9,
'C': {'D': '1', 'E': '1'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 2,
'B': 1,
'C': {'D': '1', 'E': '2'},
'date': '2019-06-27'
}
]},
{'gra': [
{
'A': 6,
'B': 1,
'C': {'D': '1', 'E': '3'},
'date': '2019-06-27'
}
]}
]
df = pd.json_normalize(data, 'gra')
cols = ['A','B','C.D','C.E','date']
df = df[cols]
print(df)
A B C.D C.E date
0 1 9 1 1 2019-06-27
1 2 1 1 2 2019-06-27
2 6 1 1 3 2019-06-27
[Program finished]

Merging arrays of dictionaries

Given the following two arrays of dictionaries, how can I merge them such that the resulting array of dictionaries contains only those dictionaries whose version is greatest?
data1 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]
data2 = [{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]
The merged result should look like this:
data3 = [{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1},
{'id': 6, 'name': 'Six', 'version': 2}]
How about the following:
//define data1 and Data2
var data1 = new[]{
new {id = 1, name = "Oneeee", version = 2},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Three", version = 2},
new {id = 4, name = "Four", version = 1},
new {id = 5, name ="Five", version = 1}
};
var data2 = new[] {
new {id = 1, name = "One", version = 1},
new {id = 2, name = "Two", version = 1},
new {id = 3, name = "Threeee", version = 3},
new {id = 6, name = "Six", version = 2}
};
//create a dictionary to handle lookups
var dict1 = data1.ToDictionary (k => k.id);
var dict2 = data2.ToDictionary (k => k.id);
// now query the data
var q = from k in dict1.Keys.Union(dict2.Keys)
select
dict1.ContainsKey(k) ?
(
dict2.ContainsKey(k) ?
(
dict1[k].version > dict2[k].version ? dict1[k] : dict2[k]
) :
dict1[k]
) :
dict2[k];
// convert enumerable back to array
var result = q.ToArray();
Alternative solution that is database friendly if data1 and data2 are tables....
var q = (
from d1 in data1
join d2 in data2 on d1.id equals d2.id into data2j
from d2j in data2j.DefaultIfEmpty()
where d2j == null || d1.version >= d2j.version
select d1
).Union(
from d2 in data2
join d1 in data1 on d2.id equals d1.id into data1j
from d1j in data1j.DefaultIfEmpty()
where d1j == null || d2.version > d1j.version
select d2
);
var result = q.ToArray();
Assume you have some class like (JSON.NET attributes used here):
public class Data
{
[JsonProperty("id")]
public int Id { get; set; }
[JsonProperty("name")]
public string Name { get; set; }
[JsonProperty("version")]
public int Version { get; set; }
}
You can parse you JSON to arrays of this Data objects:
var str1 = #"[{'id': 1, 'name': 'Oneeee', 'version': 2},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Three', 'version': 2},
{'id': 4, 'name': 'Four', 'version': 1},
{'id': 5, 'name': 'Five', 'version': 1}]";
var str2 = #"[{'id': 1, 'name': 'One', 'version': 1},
{'id': 2, 'name': 'Two', 'version': 1},
{'id': 3, 'name': 'Threeee', 'version': 3},
{'id': 6, 'name': 'Six', 'version': 2}]";
var data1 = JsonConvert.DeserializeObject<Data[]>(str1);
var data2 = JsonConvert.DeserializeObject<Data[]>(str2);
So, you can concat these two arrays, group data items by id and select from each group item with highest version:
var data3 = data1.Concat(data2)
.GroupBy(d => d.Id)
.Select(g => g.OrderByDescending(d => d.Version).First())
.ToArray(); // or ToDictionary(d => d.Id)
Result (serialized back to JSON):
[
{ "id": 1, "name": "Oneeee", "version": 2 },
{ "id": 2, "name": "Two", "version": 1 },
{ "id": 3, "name": "Threeee", "version": 3 },
{ "id": 4, "name": "Four", "version": 1 },
{ "id": 5, "name": "Five", "version": 1 },
{ "id": 6, "name": "Six", "version": 2 }
]