Pandas - json normalize inside dataframe - json

I want to break down a column in a dataframe into multiple columns.
I have a dataframe with the following configuration:
GroupId,SubGroups,Type,Name
-4781505553015217258,"{'GroupId': -732592932641342965, 'SubGroups': [], 'Type': 'DefaultSite', 'Name': 'Default Site'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': 8123255835936628631, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'MERCEDES BENZ'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': -1785570219922840611, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'VOLVO'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': -3670461095557699088, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'SCANIA'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': 8683757391859854416, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'DRIVERS'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': -8066654520755643389, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'X - DECOMMISSION'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': 4177323092254043025, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'X-INSTALLATION'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': -6088426161802844604, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'FORD'}",OrganisationGroup,CompanyXYZ
-4781505553015217258,"{'GroupId': 8512440039365422841, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'HEAVY VEHICLES'}",OrganisationGroup,CompanyXYZ
I want to create a new dataframe where the SubGroups column is broken into its components. Note that the names inside SubGroups column are prefixed with SubGroups_
GroupId, SubGroup_GroupId, SubGroup_SubGroups, SubGroup_Type, SubGroup_Name, Type, Name
-4781505553015217258, -732592932641342965, [], 'DefaultSite', 'Default Site', OrganisationGroup, CompanyXYZ
-4781505553015217258, 8123255835936628631, [], 'SiteGroup', 'MERCEDES BENZ', OrganisationGroup, CompanyXYZ
I have tried the following code:
for row in AllSubGroupsDF.itertuples():
newDF= newDF.append((pd.io.json.json_normalize(row.SubGroups)))
But it returns
GroupId,SubGroups,Type,Name
-732592932641342965,[],DefaultSite,Default Site
8123255835936628631,[],SiteGroup,MERCEDES BENZ
-1785570219922840611,[],SiteGroup,VOLVO
-3670461095557699088,[],SiteGroup,SCANIA
8683757391859854416,[],SiteGroup,DRIVERS
-8066654520755643389,[],SiteGroup,X - DECOMMISSION
4177323092254043025,[],SiteGroup,X-INSTALLATION
-6088426161802844604,[],SiteGroup,FORD
8512440039365422841,[],SiteGroup,HEAVY VEHICLES
I would like to have it all end up in one dataframe but I'm not sure how. Please help?

You can try using ast package:-
import pandas as pd
import ast
data = [[-4781505553015217258,"{'GroupId': -732592932641342965, 'SubGroups': [], 'Type': 'DefaultSite', 'Name': 'Default Site'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': 8123255835936628631, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'MERCEDES BENZ'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': -1785570219922840611, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'VOLVO'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': -3670461095557699088, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'SCANIA'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': 8683757391859854416, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'DRIVERS'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': -8066654520755643389, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'X - DECOMMISSION'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': 4177323092254043025, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'X-INSTALLATION'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': -6088426161802844604, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'FORD'}","OrganisationGroup","CompanyXYZ"],
[-4781505553015217258,"{'GroupId': 8512440039365422841, 'SubGroups': [], 'Type': 'SiteGroup', 'Name': 'HEAVY VEHICLES'}","OrganisationGroup","CompanyXYZ"]]
df = pd.DataFrame(data,columns=["GroupId","SubGroups","Type","Name"])
df["SubGroup_GroupId"] = df["SubGroups"].map(lambda x: ast.literal_eval(x)["GroupId"])
df["SubGroup_SubGroups"] = df["SubGroups"].map(lambda x: ast.literal_eval(x)["SubGroups"])
df["SubGroup_Type"] = df["SubGroups"].map(lambda x: ast.literal_eval(x)["Type"])
df["SubGroup_Name"] = df["SubGroups"].map(lambda x: ast.literal_eval(x)["Name"])
df
Hope this helps!!

Related

How can I get all of x values from what I think is JSON

Python version: 3.10
Running a function returns this:
[{'type': 1, 'components': [{'type': 2, 'style': 1, 'label': 'She/Her', 'custom_id': 'She/Her'}, {'style': 1, 'label': 'He/Him', 'custom_id': 'He/Him', 'type': 2}]}]
How can I get all values of 'custom_id' within what is returned? Thank you!
You can do it like so:
myList = [{'type': 1, 'components': [{'type': 2, 'style': 1, 'label': 'She/Her', 'custom_id': 'She/Her'}, {'style': 1, 'label': 'He/Him', 'custom_id': 'He/Him', 'type': 2}]}]
for user in list(myList[0]["components"]):
print(user["custom_id"])
You can format your json here
https://jsonformatter.curiousconcept.com/
to see, wich list is in wich :)
#infinity wrote it similar.
[
{
"type":1,
"components":[
{
"type":2,
"style":1,
"label":"She/Her",
"custom_id":"She/Her"
},
{
"style":1,
"label":"He/Him",
"custom_id":"He/Him",
"type":2
}
]
}
]
myList = [{'type': 1, 'components': [{'type': 2, 'style': 1, 'label': 'She/Her', 'custom_id': 'She/Her'}, {'style': 1, 'label': 'He/Him', 'custom_id': 'He/Him', 'type': 2}]}]
for user in myList[0]['components']:
print(user['custom_id'])

remove k,v from http response in python

I am new to python and am struggling with remove a key and value from a json return by an http request. When querying a task I get the following back.
data = requests.get(url,headers=hed).json()['data']
[{
'gid': '12011553977',
'due_on': None,
'name': 'do something',
'notes': 'blalbla,
'projects': [{
'gid': '120067502445',
'name': 'Project1'
}]
}, {
'gid': '12002408815',
'due_on': '2021-10-21',
'name': 'Proposal',
'notes': 'bla',
'projects': [{
'gid': '12314323523',
'name': 'Project1'
}, {
'gid': '12314323523',
'name': 'Project2'
}, {
'gid': '12314323523',
'name': 'Project3'
}]
I am trying to remove 'gid' from all projects so projects look like this
'projects': [{
'name': 'Company'
}]
What is the best way to do this with python3?
You can use recursion to make a simpler function to handle all elements and sub-elements. I haven't done extensive testing, or included any error checking or exception handling; but this should be close to what you want:
def rec_pop(top_level_list,key_to_pop='gid'):
for item in top_level_list:
item.pop(key_to_pop)
for v in item.values():
if isinstance(v,list):
rec_pop(v)
# call recursive fn
rec_pop(data)
Result:
In [25]: data
Out[25]:
[{'due_on': None,
'name': 'do something',
'notes': 'blalbla',
'projects': [{'name': 'Project1'}]},
{'due_on': '2021-10-21',
'name': 'Proposal',
'notes': 'bla',
'projects': [{'name': 'project2'}]}]

Pandas read one parameter from nested json

i have a following json file and i would like to read all the parameters: "dataRecordId" only and store them into a df:
{'responseInformation': '20 metadata records in response.',
'metaDataResponse': [{'timestampFrom': '2020-10-07T10:19:07.7810000Z',
'timestampTo': '2020-10-07T23:59:59.9999990Z',
'component': {'type': '', 'id': '', 'name': '', 'comment': ''},
'resource': {'type': 'EQU', 'id': '6100380', 'name': '', 'comment': ''},
'processStep': {'type': '', 'id': '', 'name': '', 'comment': ''},
'context': '',
'dataRecords': [{'dataRecordId': '171533103',
'groupName': 'Process',
'sensorName': 'AutomaticProcessActive',
'profile': 'sd',
'type': 'Switch2Way',
'unit': 'state',
'returnType': 'timeSeries'}]},
{'timestampFrom': '2020-10-08T00:00:00.6540000Z',
'timestampTo': '2020-10-08T23:59:59.9999990Z',
'component': {'type': '', 'id': '', 'name': '', 'comment': ''},
'resource': {'type': 'EQU', 'id': '6100380', 'name': '', 'comment': ''},
'processStep': {'type': '', 'id': '', 'name': '', 'comment': ''},
'context': '',
'dataRecords': [{'dataRecordId': '171534669',
'groupName': 'Process',
'sensorName': 'AutomaticProcessActive',
'profile': 'sd',
'type': 'Switch2Way',
'unit': 'state',
'returnType': 'timeSeries'}]},
This is what i did so far, but i have no idea how to go deeper in the structure, in order to achieve the 'dataRecordId':
import json
with open('file_200826_201026.json') as json_file:
data = json.load(json_file)
for p in data['metaDataResponse']:
print('p['dataRecords'])

Python giving vague error when trying to parse JSON object

I'm trying to use the peopledata API at peopledatalabs.com to retrieve data. I am using the sample python code located at https://docs.peopledatalabs.com/docs/quickstart
which is:
import requests
API_KEY = # YOUR API KEY
###
pdl_url = "https://api.peopledatalabs.com/v4/person?api_key={}&".format(API_KEY)
param_string = "name=sean thorne&company=peopledatalabs.com"
json_response = requests.get(pdl_url + param_string).json()
# OR
pdl_url = "https://api.peopledatalabs.com/v4/person"
params = {
"api_key": API_KEY,
"name": ["sean thorne"],
"company": ["peopledatalabs.com"]
}
json_response = requests.get(pdl_url, params=params).json()
json_response returns:
{'status': 200,
'likelihood': 5,
'data': {'id': 'yj5RUCSORrirXf2sf3gR',
'skills': [{'name': 'social media'},
{'name': 'strategic partnerships'},
{'name': 'public speaking'},
{'name': 'sales'},
{'name': 'photoshop'},
{'name': 'networking'},
{'name': 'mobile marketing'},
{'name': 'start ups'},
{'name': 'business development'},
{'name': 'fundraising'},
{'name': 'seo'},
{'name': 'strategy'},
{'name': 'idea generation'},
{'name': 'enterprise technology sales'},
{'name': 'entrepreneurship'},
{'name': 'social networking'},
{'name': 'creative strategy'},
{'name': 'time management'},
{'name': 'product management'},
{'name': 'social media marketing'},
{'name': 'css'},
{'name': 'https'},
{'name': 'saas'},
{'name': 'management'},
{'name': 'project management'},
{'name': 'public relations'},
{'name': 'marketing communications'},
{'name': 'sales/marketing and strategic partnerships'},
{'name': 'marketing strategy'},
{'name': 'mobile devices'},
{'name': 'installation'},
{'name': 'company culture'},
{'name': 'strategic vision'},
{'name': 'html5'},
{'name': 'hiring'}],
'industries': [{'name': 'computer software', 'is_primary': True}],
'interests': [{'name': 'location based services'},
{'name': 'mobile'},
{'name': 'social media'},
{'name': 'colleges'},
{'name': 'university students'},
{'name': 'consumer internet'},
{'name': 'college campuses'}],
'profiles': [{'network': 'linkedin',
'ids': ['145991517'],
'clean': 'linkedin.com/in/seanthorne',
'aliases': [],
'username': 'seanthorne',
'is_primary': True,
'url': 'http://www.linkedin.com/in/seanthorne'},
{'network': 'linkedin',
'ids': [],
'clean': 'linkedin.com/in/sean-thorne-9b9a8540',
'aliases': ['linkedin.com/pub/sean-thorne/40/a85/9b9'],
'username': 'sean-thorne-9b9a8540',
'is_primary': False,
'url': 'http://www.linkedin.com/in/sean-thorne-9b9a8540'},
{'network': 'twitter',
'ids': [],
'clean': 'twitter.com/seanthorne5',
'aliases': [],
'username': 'seanthorne5',
'url': 'http://www.twitter.com/seanthorne5'},
{'network': 'angellist',
'ids': [],
'clean': 'angel.co/475041',
'aliases': [],
'username': '475041',
'url': 'http://www.angel.co/475041'}],
'emails': [{'address': 'sthorne#uoregon.edu',
'type': None,
'sha256': 'e206e6cd7fa5f9499fd6d2d943dcf7d9c1469bad351061483f5ce7181663b8d4',
'domain': 'uoregon.edu',
'local': 'sthorne'},
{'address': 'sean#peopledatalabs.com',
'type': 'current_professional',
'sha256': '138ea1a7076bb01889af2309de02e8b826c27f022b21ea8cf11aca9285d5a04e',
'domain': 'peopledatalabs.com',
'local': 'sean'}],
'phone_numbers': [{'E164': '+14155688415',
'number': '+14155688415',
'type': None,
'country_code': '1',
'national_number': '4155688415',
'area_code': '415'}],
'birth_date_fuzzy': '1990',
'birth_date': None,
'gender': 'male',
'primary': {'job': {'company': {'name': 'people data labs',
'founded': '2015',
'industry': 'information technology and services',
'location': {'locality': 'san francisco',
'region': 'california',
'country': 'united states'},
'profiles': ['linkedin.com/company/peopledatalabs',
'linkedin.com/company/1640694639'],
'website': 'peopledatalabs.com',
'size': '11-50'},
'locations': [],
'end_date': None,
'start_date': '2015-03',
'title': {'levels': ['owner'],
'name': 'co-founder',
'functions': ['co founder']},
'last_updated': '2019-05-01'},
'location': {'name': 'san francisco, california, united states',
'locality': 'san francisco',
'region': 'california',
'country': 'united states',
'last_updated': '2019-01-01',
'continent': 'north america'},
'name': {'first_name': 'sean',
'middle_name': None,
'last_name': 'thorne',
'clean': 'sean thorne'},
'industry': 'computer software',
'personal_emails': [],
'linkedin': 'linkedin.com/in/seanthorne',
'work_emails': ['sean#peopledatalabs.com'],
'other_emails': ['sthorne#uoregon.edu']},
'names': [{'first_name': 'sean',
'last_name': 'thorne',
'suffix': None,
'middle_name': None,
'middle_initial': None,
'name': 'sean thorne',
'clean': 'sean thorne',
'is_primary': True}],
'locations': [{'name': 'san francisco, california, united states',
'locality': 'san francisco',
'region': 'california',
'subregion': 'city and county of san francisco',
'country': 'united states',
'continent': 'north america',
'type': 'locality',
'geo': '37.77,-122.41',
'postal_code': None,
'zip_plus_4': None,
'street_address': None,
'address_line_2': None,
'most_recent': True,
'is_primary': True,
'last_updated': '2019-01-01'}],
'experience': [{'company': {'name': 'hallspot',
'size': '1-10',
'founded': '2013',
'industry': 'computer software',
'location': {'locality': 'portland',
'region': 'oregon',
'country': 'united states'},
'profiles': ['linkedin.com/company/hallspot',
'twitter.com/hallspot',
'crunchbase.com/organization/hallspot',
'linkedin.com/company/3019184'],
'website': 'hallspot.com'},
'locations': [],
'end_date': '2015-02',
'start_date': '2012-08',
'title': {'levels': ['owner'],
'name': 'co-founder',
'functions': ['co founder']},
'type': None,
'is_primary': False,
'most_recent': False,
'last_updated': None},
{'company': {'name': 'people data labs',
'size': '11-50',
'founded': '2015',
'industry': 'information technology and services',
'location': {'locality': 'san francisco',
'region': 'california',
'country': 'united states'},
'profiles': ['linkedin.com/company/peopledatalabs',
'linkedin.com/company/1640694639'],
'website': 'peopledatalabs.com'},
'locations': [],
'end_date': None,
'start_date': '2015-03',
'title': {'levels': ['owner'],
'name': 'co-founder',
'functions': ['co founder']},
'type': None,
'is_primary': True,
'most_recent': True,
'last_updated': '2019-05-01'}],
'education': [{'school': {'name': 'university of oregon',
'type': 'post-secondary institution',
'location': 'eugene, oregon, united states',
'profiles': ['linkedin.com/edu/university-of-oregon-19207',
'facebook.com/universityoforegon',
'twitter.com/uoregon'],
'website': 'uoregon.edu'},
'end_date': '2014',
'start_date': '2010',
'gpa': None,
'degrees': [],
'majors': ['entrepreneurship'],
'minors': [],
'locations': []}]},
'dataset_version': '7.3'}
While trying to get the phone_numbers field, I have tried:
print(json_response["phone_numbers"])
and got the error code:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-132-2acb0f9f59c5> in <module>()
----> 1 json_response["phone_numbers"]
KeyError: 'phone_numbers'
I am hoping to get the number '+14155688415' as my result
print(json_response["data"]["phone_numbers"])
When dealing with lots of data like that, JSONLint is a good resource to stay organized.

Strings getting converted to null when writing JSON representation of RDD

I am trying to write RDD which is structure like
(int , ListofList , ListofListofList)
Something like this
(49807360, [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , [111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , 111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123],[111206019,'ABC','XYZ:RDC' , 'RDC' , 123]])
When I print this is RDD form I see the data correctly. When I used inbuilt library to write it in JSON format I am getting null values in place of strings.
{"user":49807360,"history":[[111206019,null,null,null,123], [111206019,null,null,null,123]],"collection":...}
The line of code I am using to serialize RDD to JSON is
rdd.toDF().toJSON().saveAsTextFile(ouput_file_path)
I have also tried
rdd.toDF().write.json(ouput_file_path,"overwrite","gzip")
Above code was run in spark version 2.0.0
This happens because you use DataFrame as an intermediate step. Spark SQL doesn't support heterogeneous arrays, so values which don't match inferred type (array<bigint>) are replaced by NULL.
If you really want to go this way, and support heterogeneous structures, you should use tuples which should be mapped to Spark SQL structs, or don't depend on schema inference, and provide desired schema explicitly:
schema = ... # type: StructType
spark.createDataFrame(rdd, schema)
with schema (JSON representation) similar to:
{'fields': [{'metadata': {}, 'name': '_1', 'nullable': True, 'type': 'long'},
{'metadata': {},
'name': '_2',
'nullable': True,
'type': {'containsNull': True,
'elementType': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'},
'type': 'array'}},
{'metadata': {},
'name': '_3',
'nullable': True,
'type': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'}},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'long'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_6', 'nullable': True, 'type': 'long'}],
'type': 'struct'}},
{'metadata': {},
'name': '_4',
'nullable': True,
'type': {'containsNull': True,
'elementType': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'},
'type': 'array'}}],
'type': 'struct'}