How do I json normalize this nested structure using Pandas? - json

I have a JSON url with the following structure from THIS url and am trying to get the name, price and volume from the structure below
{'data': {'1': {'id': 1,
'name': 'Bitcoin',
'symbol': 'BTC',
'website_slug': 'bitcoin',
'rank': 1,
'circulating_supply': 17115025.0,
'total_supply': 17115025.0,
'max_supply': 21000000.0,
'quotes': {'USD': {'price': 6317.68,
'volume_24h': 5034440000.0,
'market_cap': 108127251142.0,
'percent_change_1h': 0.22,
'percent_change_24h': 5.26,
'percent_change_7d': -4.37}},
'last_updated': 1529943576},
'2': {'id': 2,
'name': 'Litecoin',
'symbol': 'LTC',
'website_slug': 'litecoin',
'rank': 6,
'circulating_supply': 57133246.0,
'total_supply': 57133246.0,
'max_supply': 84000000.0,
'quotes': {'USD': {'price': 84.4893,
'volume_24h': 512241000.0,
'market_cap': 4827147957.0,
'percent_change_1h': 1.97,
'percent_change_24h': 8.96,
'percent_change_7d': -12.54}},
'last_updated': 1529943541}},
'metadata': {'timestamp': 1529943282,
'num_cryptocurrencies': 1586,
'error': None}}
I tried several variations to get each coin in a row but have failed so far
Attempt 1
df = pd.read_json('https://api.coinmarketcap.com/v2/ticker')
Attempt 2
data = requests.get('https://api.coinmarketcap.com/v2/ticker',params).json()
df = pd.DataFrame(data['data'])
df
Attempt 3
I found this function on stackoverflow called json normalize and I tried to use it but no luck so far
df = pd.io.json.json_normalize(data['data'])
df
Any suggestions on how to turn each coin into a row are super appreciated
UPDATE 1
params = {'start': 0, 'sort': 'id', 'limit': 100}
data = requests.get('https://api.coinmarketcap.com/v2/ticker', params).json()
df = pd.DataFrame(data['data'])
df = df.transpose()
df.set_index('id')
This is pretty close to what I want, but how do I get the volume and price out of quotes

assuming "quotes" only have 1 row and the key is "USD", I did this
df.drop('quotes', 1).join(
pd.DataFrame(
df.quotes.apply(
lambda x: {'USD'+'_'+key: val for key, val in x['USD'].items()}
).tolist()
)
)

Related

How to handle the variable size json file in python to create DataFrame using pandas

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting.
eg:
1st chunk:
{'ad': 0,
'country': 'US',
'ver': '1.0',
'adIdType': 2,
'adValue': '5',
'data': {'eventId': 99,
'clickId': '',
'eventType': 'PURCHASEMADE',
'tms': '2019-12-25T09:57:04+0000',
'productDetails': {'currency': 'DLR',
'productList': [
{'segment': 'Girls',
'vertical': 'Fashion Jewellery',
'brickname': 'Traditional Jewellery',
'price': 8,
'quantity': 10}]},
'transactionId': '1254'},
'appName': 'xer.tt',
'appId': 'XR',
'sdkVer': '1.0.0',
'language': 'en',
'tms': '2022-04-25T09:57:04+0000',
'tid': '124'}
2nd chunk:
{'ad': 0,
'country': 'US',
'ver': '1.0',
'adIdType': 2,
'adValue': '78',
'data': {'eventId': 7,
'clickId': '',
'eventType': 'PURCHASEMADE',
'tms': '20219-02-25T09:57:04+0000',
'productDetails': {'currency': 'DLR',
'productList': [{'segment': 'Boys',
'vertical': 'Fashion',
'brickname': 'Casuals',
'price': 10,
'quantity': 5},
{'segment': 'Girls',
'vertical': 'Fashion Jewellery',
'brickname': 'Traditional Jewellery',
'price': 8,
'quantity': 10}]},
'transactionId': '3258'},
'appName': 'xer.tt',
'appId': 'XR',
'sdkVer': '1.0.0',
'language': 'en',
'tms': '2029-02-25T09:57:04+0000',
'tid': '124'}
Now in the ProductDetails the number of products are getting changes, in the first chunk we have only 1 product listed and it's detailed but in the 2nd chunk, we have 2 products listed and it's detailed, for further chunks we can have ANY number of products for other chunks also. (i.e. chunks~Records)
I tried doing that by writing some python scripts but was not able to come to any good solution.
PS: If any further detail is required please let me know in the comments.
Thanks!
What you can do, is use pd.json_normalize and have the most "inner" dictionary as your record_path and all other data you are interested in as your meta . Here is an in-depth example how you could construct that: pandas.io.json.json_normalize with very nested json
In your case, that would for example be (for a single object):
df = pd.json_normalize(obj,
record_path=["data", "productDetails", "productList"],
meta=([
["data", "productDetails", "currency"],
["data", "transactionId"],
["data", "clickId"],
["data", "eventType"],
["data", "tms"],
"ad",
"country"
])
)

How to parse nested JSON file in Pandas

I'm trying to transform a JSON file generated by the Day One Journal to a text file using Python but hit a brick wall.
This is broadly the format:
{'metadata': {'version': '1.0'},
'entries': [{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"attributes":{"line":{"header":1,"identifier":"F78B28DA-488E-489E-9C95-1A0648099792"}},"text":"2022\\n"},{"attributes":{"line":{"header":0,"identifier":"FA8C6594-F43D-4652-B442-DAF72A379799"}},"text":"\\n"},{"attributes":{"line":{"header":0,"identifier":"0923BCC8-B24A-4C0D-963C-73D09561EECD"}},"text":"It’s the beginning of a new year"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]},{"text":"\\n\\n\\n\\n"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 2925.313938140869,
'timeZone': 'Europe/London',
'creationDeviceType': 'Hal 9000',
'uuid': '988D9D9876624FAEB88F9BCC666FD9CD',
'creationDeviceModel': 'MacBookPro15,2',
'starred': False,
'location': {'region': {'center': {'longitude': -0.0095,
'latitude': 51},
'radius': 75},
'localityName': 'London',
'country': 'United Kingdom',
'timeZoneName': 'Europe/London',
'administrativeArea': 'England',
'longitude': -0.0095,
'placeName': 'Somewhere',
'latitude': 51},
'isPinned': False,
'creationDevice': 'somedevice'...,
}
I only want the 'text' (of which there might be a number of 'text' entries and 'creationDate' so I've got a daily record.
My code to pull out the data is straightforward:
import json
# Opening JSON file
f = open('files/2022.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Closing file
f.close()
I've tried using list comprensions and then concatenating the Series in Pandas, but two don't match in length - because multiple entries on one day mix up the dataframe.
I wanted to use this code, but:
result = []
for i in data['entries']:
entry = i['creationDate'] + i['text']
result.append(entry)
but I get this error:
KeyError: 'text'
What do I need to do?
Update:
{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"text":"Later than I planned\\n"}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 672.3099998235703,
'timeZone': 'Europe/London',
'creationDeviceType': 'Computer',
'uuid': 'F53DCC5E05BB4106A49C76954117DBF4',
'creationDeviceModel': 'xompurwe',
'isPinned': False,
'creationDevice': 'Computer',
'text': 'Later than I planned \\\n',
'modifiedDate': '2022-01-05T01:01:29Z',
'isAllDay': False,
'creationDate': '2022-01-05T00:39:19Z',
'creationOSName': 'macOS'},
Sort of managed to work a solution - thank you to everyone who helped this morning, particularly #Tomer S.
My solution was:
result = []
for i in data['entries']:
print (i['creationDate'] + i['text'])
result.append(entry)
It still won't get what I want

How to create MultiIndex Dataframe from a nested dictionary (many levels)

I am using the pyflightdata library to search for flight stats. It returns json inside a list of dicts.
Here is an example of the first dictionary in the list after my query:
> flightlog = {'identification': {'number': {'default': 'KE504', 'alternative': 'None'}, 'callsign': 'KAL504', 'codeshare': 'None'}
, 'status': {'live': False, 'text': 'Landed 22:29', 'estimated': 'None', 'ambiguous': False, 'generic': {'status': {'text': 'landed', 'type': 'arrival', 'color': 'green', 'diverted': 'None'}
, 'eventTime': {'utc_millis': 1604611778000, 'utc_date': '20201105', 'utc_time': '2229', 'utc': 1604611778, 'local_millis': 1604615378000, 'local_date': '20201105', 'local_time': '2329', 'local': 1604615378}}}
, 'aircraft': {'model': {'code': 'B77L', 'text': 'Boeing 777-FEZ'}, 'registration': 'HL8075', 'country': {'name': 'South Korea', 'alpha2': 'KR', 'alpha3': 'KOR'}}
, 'airline': {'name': 'Korean Air', 'code': {'iata': 'KE', 'icao': 'KAL'}}
, 'airport': {'origin': {'name': 'London Heathrow Airport', 'code': {'iata': 'LHR', 'icao': 'EGLL'}, 'position': {'latitude': 51.471626, 'longitude': -0.467081, 'country': {'name': 'United Kingdom', 'code': 'GB'}, 'region': {'city': 'London'}}
, 'timezone': {'name': 'Europe/London', 'offset': 0, 'abbr': 'GMT', 'abbrName': 'Greenwich Mean Time', 'isDst': False}}, 'destination': {'name': 'Paris Charles de Gaulle Airport', 'code': {'iata': 'CDG', 'icao': 'LFPG'}, 'position': {'latitude': 49.012516, 'longitude': 2.555752, 'country': {'name': 'France', 'code': 'FR'}, 'region': {'city': 'Paris'}}, 'timezone': {'name': 'Europe/Paris', 'offset': 3600, 'abbr': 'CET', 'abbrName': 'Central European Time', 'isDst': False}}, 'real': 'None'}
, 'time': {'scheduled': {'departure_millis': 1604607300000, 'departure_date': '20201105', 'departure_time': '2115', 'departure': 1604607300, 'arrival_millis': 1604612700000, 'arrival_date': '20201105', 'arrival_time': '2245', 'arrival': 1604612700}, 'real': {'departure_millis': 1604609079000, 'departure_date': '20201105', 'departure_time': '2144', 'departure': 1604609079, 'arrival_millis': 1604611778000, 'arrival_date': '20201105', 'arrival_time': '2229', 'arrival': 1604611778}, 'estimated': {'departure': 'None', 'arrival': 'None'}, 'other': {'eta_millis': 1604611778000, 'eta_date': '20201105', 'eta_time': '2229', 'eta': 1604611778}}}
This dictionary is a huge, multi-nested, json mess and I am struggling to find a way to make it readable. I guess something like this:
identification number default KE504
alternative None
callsign KAL504
codeshare None
status live False
text Landed 22:29
Estimated None
ambiguous False
...
I am trying to turn it into a pandas DataFrame, with mixed results.
In this post it was explained that MultiIndex values have to be tuples, not dictionaries, so I used their example to convert my dictionary:
> flightlog_tuple = {(outerKey, innerKey): values for outerKey, innerDict in flightlog.items() for innerKey, values in innerDict.items()}
Which worked, up to a certain point.
df2 = pd.Series(flightlog_tuple)
gives the following output:
identification number {'default': 'KE504', 'alternative': 'None'}
callsign KAL504
codeshare None
status live False
text Landed 22:29
estimated None
ambiguous False
generic {'status': {'text': 'landed', 'type': 'arrival...
aircraft model {'code': 'B77L', 'text': 'Boeing 777-FEZ'}
registration HL8075
country {'name': 'South Korea', 'alpha2': 'KR', 'alpha...
airline name Korean Air
code {'iata': 'KE', 'icao': 'KAL'}
airport origin {'name': 'London Heathrow Airport', 'code': {'...
destination {'name': 'Paris Charles de Gaulle Airport', 'c...
real None
time scheduled {'departure_millis': 1604607300000, 'departure...
real {'departure_millis': 1604609079000, 'departure...
estimated {'departure': 'None', 'arrival': 'None'}
other {'eta_millis': 1604611778000, 'eta_date': '202...
dtype: object
Kind of what I was going for but some of the indexes are still in the column with values because there are so many levels. So I followed this explanation and tried to add more levels:
level_up = {(level1Key, level2Key, level3Key): values for level1Key, level2Dict in flightlog.items() for level2Key, level3Dict in level2Dict.items() for level3Key, values in level3Dict.items()}
df2 = pd.Series(level_up)
This code gives me AttributeError: 'str' object has no attribute 'items'. I don't understand why the first 2 indexes worked, but the others give an error.
I've tried other methods like MultiIndex.from_tuple or DataFrame.from_dict, but I can't get it to work.
This Dictionary is too complex as a beginner. I don't know what the right approach is. Maybe I am using DataFrames in the wrong way. Maybe there is an easier way to access the data that I am overlooking.
Any help would be much appreciated!

convert api response to pandas

I'd like to convert API response into a pandas dataframe to make it easier to manipulate.
Below it's what I've tried so far:
import requests
import pandas as pd
URL = 'https://api.gleif.org/api/v1/lei-records?page[size]=10&page[number]=1&filter[entity.names]=*'
r = requests.get(URL, proxies=proxyDict)
x = r.json()
x
out:
{'meta': {'goldenCopy': {'publishDate': '2020-07-14T00:00:00Z'},
'pagination': {'currentPage': 1,
'perPage': 10,
'from': 1,
'to': 10,
'total': 1675786,
'lastPage': 167579}},
'links': {'first': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=1&page%5Bsize%5D=10',
'next': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=2&page%5Bsize%5D=10',
'last': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=167579&page%5Bsize%5D=10'},
'data': [{'type': 'lei-records',
'id': '254900RR9EUYHB7PI211',
'attributes': {'lei': '254900RR9EUYHB7PI211',
'entity': {'legalName': {'name': 'MedicLights Research Inc.',
'language': None},
'otherNames': [],
'transliteratedOtherNames': [],
'legalAddress': {'language': None,
'addressLines': ['300 Ranee Avenue'],
'addressNumber': None,
'addressNumberWithinBuilding': None,
'mailRouting': None,
'city': 'Toronto',
'region': 'CA-ON',
'country': 'CA',
'postalCode': 'M6A 1N8'},
'headquartersAddress': {'language': None,
'addressLines': ['76 Marble Arch Crescent'],
'addressNumber': None,
'addressNumberWithinBuilding': None,
'mailRouting': None,
'city': 'Toronto',
'region': 'CA-ON',
'country': 'CA',
'postalCode': 'M1R 1W9'},
'registeredAt': {'id': 'RA000079', 'other': None},
'registeredAs': '002185472',
'jurisdiction': 'CA-ON',
'category': None,
'legalForm': {'id': 'O90R', 'other': None},
'associatedEntity': {'lei': None, 'name': None},
'status': 'ACTIVE',
'expiration': {'date': None, 'reason': None},
'successorEntity': {'lei': None, 'name': None},
'otherAddresses': []},
'registration': {'initialRegistrationDate': '2020-07-13T21:09:50Z',
'lastUpdateDate': '2020-07-13T21:09:50Z',
'status': 'ISSUED',
'nextRenewalDate': '2021-07-13T21:09:50Z',
'managingLou': '5493001KJTIIGC8Y1R12',
'corroborationLevel': 'PARTIALLY_CORROBORATED',
'validatedAt': {'id': 'RA000079', 'other': None},
'validatedAs': '002185472'},
'bic': None},
'relationships': {'managing-lou': {'links': {'related': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/managing-lou'}},
'lei-issuer': {'links': {'related': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/lei-issuer'}},
'direct-parent': {'links': {'reporting-exception': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/direct-parent-reporting-exception'}},
'ultimate-parent': {'links': {'reporting-exception': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/ultimate-parent-reporting-exception'}}},
'links': {'self': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211'}},
{'type': 'lei-records',
'id': '254900F9XV2K6IR5TO93',
Then I tried to put it into pandas and gives me the following results
f = pd.DataFrame(x['data'])
f
type id attributes relationships links
0 lei-records 254900RR9EUYHB7PI211 {'lei': '254900RR9EUYHB7PI211', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
1 lei-records 254900F9XV2K6IR5TO93 {'lei': '254900F9XV2K6IR5TO93', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
2 lei-records 254900DIC0729LEXNL12 {'lei': '254900DIC0729LEXNL12', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
Which isn't the result expected. I even tried to read_json with below codes:
g = pd.read_json(x.text)
g
which gives me the error
AttributeError: 'dict' object has no attribute 'text'
the expected output should look like this:
lei entity.legalName.name entity.legalAddress.addressLines entity.legalAddress.city entity.legalAddress.postalcode status registration.status
254900RR9EUYHB7PI211 MedicLights Research Inc. 300 Ranee Avenue Toronto M6A 1N8 ACTIVE ISSUED
Thanks for anyone helping
Use json_normalize like:
pd.json_normalize(x['data'])
Here is another method to use the pandas to normalize the json file using pandas.io.json.json_normalize from pandas.io.json library.
How to normalize json correctly by Python Pandas

Nesting/grouping a range of columns when converting a Pandas DataFrame to a dictionary

I've been trying to work out how to convert a Pandas DataFrame into a list of nested dictionaries and I haven't been having any luck.
My first thought was to convert the DataFrame into a list of dictionaries (with users = users.to_dict(orient='records')) and then merge the 'address' and 'color_preference' items into sublists but there must be a better way to do it!
I have a dataframe like this:
import pandas as pd
users = pd.DataFrame({'email_address': ["email#email.com"], 'status': ["active"], 'address': ["1 Eagle St"], 'suburb': ["BROOKLYN"], 'state': ["NY"], 'postcode': ["11201"], 'country': ["USA"], 'red': [False], 'orange': [True], 'yellow': [True], 'green': [True], 'blue': [False], 'indigo': [False], 'violet': [False]})
and I'm trying to convert it into this format:
{
"email_address":"email#email.com",
"status":"active",
"address":{
"address":"1 Eagle St",
"suburb":"Brooklyn",
"state":"NY",
"postcode":"11201",
"country":"USA"
},
"color_preference":{
"red":false,
"orange":true,
"yellow":true,
"green":true,
"blue":false,
"indigo":false,
"violet":false
}
}
You can do this explicitly with apply (I've done the first couple but you could do all the address/colors):
def extract_json(row):
return {
"email_address": row.loc["email_address"],
"status": row.loc["status"],
"address": row.loc[["address", "suburb"]].to_dict(),
"color_preference": row.loc[["red", "orange"]].to_dict()
}
In [11]: users.apply(extract_json, axis=1)
Out[11]:
0 {'email_address': 'email#email.com', 'status':...
dtype: object
In [12]: users.apply(extract_json, axis=1).tolist()
Out[12]:
[{'email_address': 'email#email.com',
'status': 'active',
'address': {'address': '1 Eagle St', 'suburb': 'BROOKLYN'},
'color_preference': {'red': False, 'orange': True}}]
You could pull out all the address/colors by position:
In [21]: users.columns[2:7]
Out[21]: Index(['address', 'suburb', 'state', 'postcode', 'country'], dtype='object')
In [22]: users.columns[7:]
Out[22]: Index(['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet'], dtype='object')