Reading the specific json element in un-fixed length group - json

I am trying to read the specific tag within a JSON file with python that I got from API and if they were fixed, I would have had no problem, but it seems that sometimes the elements jump around I can't go after the "sequence" number, but have to use the name to locate it. The name should stay consistent.
Here are the two types that I have seen so far, but I am sure there could be more variation, so instead of relying on the
heroID = data[count]['player'][0]['data'][8]['number']
to extract the value, I would much rather, look for the location of "HeroID" and read that into it variable.
longer one
[
{'id': 'HeroBattleTag', 'string': 'TFYoDa#1456'},
{'id': 'GameAccount', 'number': 10519139},
{'id': 'HeroClass', 'string': 'monk'},
{'id': 'HeroGender', 'string': 'f'},
{'id': 'HeroLevel', 'number': 70},
{'id': 'ParagonLevel', 'number': 1212},
{'id': 'HeroClanTag', 'string': 'Sc'},
{'id': 'ClanName', 'string': 'Super CasuaI'},
{'id': 'HeroId', 'number': 95443875}
]
shorter one
[
{'id': 'HeroBattleTag', 'string': 'Michael#1920'},
{'id': 'GameAccount', 'number': 96532923},
{'id': 'HeroClass', 'string': 'monk'},
{'id': 'HeroGender', 'string': 'f'},
{'id': 'HeroLevel', 'number': 70},
{'id': 'ParagonLevel', 'number': 1062},
{'id': 'HeroId', 'number': 95441675}
]

I think my question was flawed from beginning, but I was able to write the code by building an iteration over a child json record, while already iterating over the parent one.
count = 0
for i in data:
character = []
rank = data[count]['order'] #ladderRank
accountId = data[count]['player'][0]['accountId'] #accountID
c = 0
for k in data[count]['player'][0]['data']:
if k['id'] == 'HeroId':
heroID = k['number']
pprint.pprint(heroID)
break
else:
c = c + 1
if c > len(data[0]['player'][0]['data']):
break

Related

How to create MultiIndex Dataframe from a nested dictionary (many levels)

I am using the pyflightdata library to search for flight stats. It returns json inside a list of dicts.
Here is an example of the first dictionary in the list after my query:
> flightlog = {'identification': {'number': {'default': 'KE504', 'alternative': 'None'}, 'callsign': 'KAL504', 'codeshare': 'None'}
, 'status': {'live': False, 'text': 'Landed 22:29', 'estimated': 'None', 'ambiguous': False, 'generic': {'status': {'text': 'landed', 'type': 'arrival', 'color': 'green', 'diverted': 'None'}
, 'eventTime': {'utc_millis': 1604611778000, 'utc_date': '20201105', 'utc_time': '2229', 'utc': 1604611778, 'local_millis': 1604615378000, 'local_date': '20201105', 'local_time': '2329', 'local': 1604615378}}}
, 'aircraft': {'model': {'code': 'B77L', 'text': 'Boeing 777-FEZ'}, 'registration': 'HL8075', 'country': {'name': 'South Korea', 'alpha2': 'KR', 'alpha3': 'KOR'}}
, 'airline': {'name': 'Korean Air', 'code': {'iata': 'KE', 'icao': 'KAL'}}
, 'airport': {'origin': {'name': 'London Heathrow Airport', 'code': {'iata': 'LHR', 'icao': 'EGLL'}, 'position': {'latitude': 51.471626, 'longitude': -0.467081, 'country': {'name': 'United Kingdom', 'code': 'GB'}, 'region': {'city': 'London'}}
, 'timezone': {'name': 'Europe/London', 'offset': 0, 'abbr': 'GMT', 'abbrName': 'Greenwich Mean Time', 'isDst': False}}, 'destination': {'name': 'Paris Charles de Gaulle Airport', 'code': {'iata': 'CDG', 'icao': 'LFPG'}, 'position': {'latitude': 49.012516, 'longitude': 2.555752, 'country': {'name': 'France', 'code': 'FR'}, 'region': {'city': 'Paris'}}, 'timezone': {'name': 'Europe/Paris', 'offset': 3600, 'abbr': 'CET', 'abbrName': 'Central European Time', 'isDst': False}}, 'real': 'None'}
, 'time': {'scheduled': {'departure_millis': 1604607300000, 'departure_date': '20201105', 'departure_time': '2115', 'departure': 1604607300, 'arrival_millis': 1604612700000, 'arrival_date': '20201105', 'arrival_time': '2245', 'arrival': 1604612700}, 'real': {'departure_millis': 1604609079000, 'departure_date': '20201105', 'departure_time': '2144', 'departure': 1604609079, 'arrival_millis': 1604611778000, 'arrival_date': '20201105', 'arrival_time': '2229', 'arrival': 1604611778}, 'estimated': {'departure': 'None', 'arrival': 'None'}, 'other': {'eta_millis': 1604611778000, 'eta_date': '20201105', 'eta_time': '2229', 'eta': 1604611778}}}
This dictionary is a huge, multi-nested, json mess and I am struggling to find a way to make it readable. I guess something like this:
identification number default KE504
alternative None
callsign KAL504
codeshare None
status live False
text Landed 22:29
Estimated None
ambiguous False
...
I am trying to turn it into a pandas DataFrame, with mixed results.
In this post it was explained that MultiIndex values have to be tuples, not dictionaries, so I used their example to convert my dictionary:
> flightlog_tuple = {(outerKey, innerKey): values for outerKey, innerDict in flightlog.items() for innerKey, values in innerDict.items()}
Which worked, up to a certain point.
df2 = pd.Series(flightlog_tuple)
gives the following output:
identification number {'default': 'KE504', 'alternative': 'None'}
callsign KAL504
codeshare None
status live False
text Landed 22:29
estimated None
ambiguous False
generic {'status': {'text': 'landed', 'type': 'arrival...
aircraft model {'code': 'B77L', 'text': 'Boeing 777-FEZ'}
registration HL8075
country {'name': 'South Korea', 'alpha2': 'KR', 'alpha...
airline name Korean Air
code {'iata': 'KE', 'icao': 'KAL'}
airport origin {'name': 'London Heathrow Airport', 'code': {'...
destination {'name': 'Paris Charles de Gaulle Airport', 'c...
real None
time scheduled {'departure_millis': 1604607300000, 'departure...
real {'departure_millis': 1604609079000, 'departure...
estimated {'departure': 'None', 'arrival': 'None'}
other {'eta_millis': 1604611778000, 'eta_date': '202...
dtype: object
Kind of what I was going for but some of the indexes are still in the column with values because there are so many levels. So I followed this explanation and tried to add more levels:
level_up = {(level1Key, level2Key, level3Key): values for level1Key, level2Dict in flightlog.items() for level2Key, level3Dict in level2Dict.items() for level3Key, values in level3Dict.items()}
df2 = pd.Series(level_up)
This code gives me AttributeError: 'str' object has no attribute 'items'. I don't understand why the first 2 indexes worked, but the others give an error.
I've tried other methods like MultiIndex.from_tuple or DataFrame.from_dict, but I can't get it to work.
This Dictionary is too complex as a beginner. I don't know what the right approach is. Maybe I am using DataFrames in the wrong way. Maybe there is an easier way to access the data that I am overlooking.
Any help would be much appreciated!

convert api response to pandas

I'd like to convert API response into a pandas dataframe to make it easier to manipulate.
Below it's what I've tried so far:
import requests
import pandas as pd
URL = 'https://api.gleif.org/api/v1/lei-records?page[size]=10&page[number]=1&filter[entity.names]=*'
r = requests.get(URL, proxies=proxyDict)
x = r.json()
x
out:
{'meta': {'goldenCopy': {'publishDate': '2020-07-14T00:00:00Z'},
'pagination': {'currentPage': 1,
'perPage': 10,
'from': 1,
'to': 10,
'total': 1675786,
'lastPage': 167579}},
'links': {'first': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=1&page%5Bsize%5D=10',
'next': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=2&page%5Bsize%5D=10',
'last': 'https://api.gleif.org/api/v1/lei-records?filter%5Bentity.names%5D=%2A&page%5Bnumber%5D=167579&page%5Bsize%5D=10'},
'data': [{'type': 'lei-records',
'id': '254900RR9EUYHB7PI211',
'attributes': {'lei': '254900RR9EUYHB7PI211',
'entity': {'legalName': {'name': 'MedicLights Research Inc.',
'language': None},
'otherNames': [],
'transliteratedOtherNames': [],
'legalAddress': {'language': None,
'addressLines': ['300 Ranee Avenue'],
'addressNumber': None,
'addressNumberWithinBuilding': None,
'mailRouting': None,
'city': 'Toronto',
'region': 'CA-ON',
'country': 'CA',
'postalCode': 'M6A 1N8'},
'headquartersAddress': {'language': None,
'addressLines': ['76 Marble Arch Crescent'],
'addressNumber': None,
'addressNumberWithinBuilding': None,
'mailRouting': None,
'city': 'Toronto',
'region': 'CA-ON',
'country': 'CA',
'postalCode': 'M1R 1W9'},
'registeredAt': {'id': 'RA000079', 'other': None},
'registeredAs': '002185472',
'jurisdiction': 'CA-ON',
'category': None,
'legalForm': {'id': 'O90R', 'other': None},
'associatedEntity': {'lei': None, 'name': None},
'status': 'ACTIVE',
'expiration': {'date': None, 'reason': None},
'successorEntity': {'lei': None, 'name': None},
'otherAddresses': []},
'registration': {'initialRegistrationDate': '2020-07-13T21:09:50Z',
'lastUpdateDate': '2020-07-13T21:09:50Z',
'status': 'ISSUED',
'nextRenewalDate': '2021-07-13T21:09:50Z',
'managingLou': '5493001KJTIIGC8Y1R12',
'corroborationLevel': 'PARTIALLY_CORROBORATED',
'validatedAt': {'id': 'RA000079', 'other': None},
'validatedAs': '002185472'},
'bic': None},
'relationships': {'managing-lou': {'links': {'related': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/managing-lou'}},
'lei-issuer': {'links': {'related': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/lei-issuer'}},
'direct-parent': {'links': {'reporting-exception': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/direct-parent-reporting-exception'}},
'ultimate-parent': {'links': {'reporting-exception': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211/ultimate-parent-reporting-exception'}}},
'links': {'self': 'https://api.gleif.org/api/v1/lei-records/254900RR9EUYHB7PI211'}},
{'type': 'lei-records',
'id': '254900F9XV2K6IR5TO93',
Then I tried to put it into pandas and gives me the following results
f = pd.DataFrame(x['data'])
f
type id attributes relationships links
0 lei-records 254900RR9EUYHB7PI211 {'lei': '254900RR9EUYHB7PI211', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
1 lei-records 254900F9XV2K6IR5TO93 {'lei': '254900F9XV2K6IR5TO93', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
2 lei-records 254900DIC0729LEXNL12 {'lei': '254900DIC0729LEXNL12', 'entity': {'le... {'managing-lou': {'links': {'related': 'https:... {'self': 'https://api.gleif.org/api/v1/lei-rec...
Which isn't the result expected. I even tried to read_json with below codes:
g = pd.read_json(x.text)
g
which gives me the error
AttributeError: 'dict' object has no attribute 'text'
the expected output should look like this:
lei entity.legalName.name entity.legalAddress.addressLines entity.legalAddress.city entity.legalAddress.postalcode status registration.status
254900RR9EUYHB7PI211 MedicLights Research Inc. 300 Ranee Avenue Toronto M6A 1N8 ACTIVE ISSUED
Thanks for anyone helping
Use json_normalize like:
pd.json_normalize(x['data'])
Here is another method to use the pandas to normalize the json file using pandas.io.json.json_normalize from pandas.io.json library.
How to normalize json correctly by Python Pandas

Why does dask.bag.read_text(filename).map(json.loads) return a list?

I need to read several json.gz files using Dask. I am trying to achieve this by using dask.bag.read_text(filename).map(json.loads), but the output is a nested list (the files contain lists of dictionaries), whereas I would like to get a just a list of dictionaries.
I have included a small example that reproduces my problem, below.
import json
import gzip
import dask.bag as db
dict_list = [{'id': 123, 'name': 'lemurt', 'indices': [1,10]}, {'id': 345, 'name': 'katin', 'indices': [2,11]}]
filename = './test.json.gz'
# Write json
with gzip.open(filename, 'wt') as write_file:
json.dump(dict_list , write_file)
# Read json
with gzip.open(filename, "r") as read_file:
data = json.load(read_file)
# Read json with Dask
data_dask = db.read_text(filename).map(json.loads).compute()
print(data)
print(data_dask)
I would like to get the first output:
[{'id': 123, 'name': 'lemurt', 'indices': [1, 10]}, {'id': 345, 'name': 'katin', 'indices': [2, 11]}]
But instead I get the second one:
[[{'id': 123, 'name': 'lemurt', 'indices': [1, 10]}, {'id': 345, 'name': 'katin', 'indices': [2, 11]}]]
The read_text function returns a bag, where each element is a line of text. So you have a list of strings. Then, you parse each of those lines of text with json.loads, so each of those lines of text becomes a list again. So you have a list of lists.
In your case you might use map_partitions, and a function that expects a list of a single line of text
b = db.read_text("*.json.gz").map(lambda L: json.loads(L[0]))
Following the comment by #MRocklin, I ended up solving my problem by changing the way I was writing the json.gz files.
Instead of
with gzip.open(filename, 'wt') as write_file:
json.dump(dict_list , write_file)
I used
with gzip.open(filename, 'wt') as write_file:
for dd in dict_list:
json.dump(dd , write_file)
write_file.write("\n")
and kept reading the files as
db.read_text(filename).map(json.loads)

How can I iterate over Jason structure?

I am trying to create a list to collect all the ids from a JSON file.
To get one id I did: list= dict['files']['file'][0]['id']. I was wondering if I can do loop for it.
The Json Object is:
{'files':{'page': 1, 'pages': 123, 'perpage': 2, 'file': [{'id': '123', 'name': 'John'}, {'id': '234', 'name': 'Lee'}, {'id': '345', 'name': 'Josh'}, {'id': '456', 'name': 'mi'...}
enter image description here
In python3 you can iterate over a dictionary like so:
for key, val in dic['file'][file].items():
print(key, val['id'])

Python: Converting dictionaries into pandas dataframe

I've got some data out of the Pocket API and the resulting JSON called list has some nested JSON within it. Sample below
{'complete': 1,
'error': None,
'list': {'1992211110': {'authors': {'8683682': {'author_id': '8683682',
'item_id': '1992211110',
'name': 'Robert Kuttner',
'url': 'http://www.nybooks.com/contributors/robert-kuttner/'}},
'excerpt': 'What a splendid era this was going to be, with one remaining superpower spreading capitalism and liberal democracy around the world. Instead, democracy and capitalism seem increasingly incompatible.',
'favorite': '0',
'given_title': '',
'given_url': 'http://nyrevinc.cmail20.com/t/y-l-klpdut-jduhlyklkl-d/',
'has_image': '0',
'has_video': '0',
'is_article': '1',
'is_index': '0',
'item_id': '1992211110',
'resolved_id': '1977788178',
'resolved_title': 'The Man from Red Vienna',
'resolved_url': 'http://www.nybooks.com/articles/2017/12/21/karl-polanyi-man-from-red-vienna/',
'sort_id': 6,
'status': '0',
'time_added': '1520132694',
'time_favorited': '0',
'time_read': '0',
'time_updated': '1520140351',
'word_count': '4009'},
I've managed to get the whole results into a dataframe however there is some nesting of what looks like a dictionary called authors? I've managed to split this out into dictionaries with an index but can't figure out how to get that into a dataframe. Sample below of authors:
{1: {'authors': {'8683682': {'author_id': '8683682',
'item_id': '1992211110',
'name': 'Robert Kuttner',
'url': 'http://www.nybooks.com/contributors/robert-kuttner/'}}},
2: {'authors': {'53525958': {'author_id': '53525958',
'item_id': '2086463428',
'name': 'Adam Tooze',
'url': 'http://www.nybooks.com/contributors/adam-tooze/'}}},
3: {'authors': {'3490600': {'author_id': '3490600',
'item_id': '2090266893',
'name': 'Adam Liaw',
'url': ''}}},
4: {'authors': {'75929933': {'author_id': '75929933',
'item_id': '2091894678',
'name': 'umair haque',
'url': 'https://eand.co/#umairh'}}},
5: {'authors': {'61177521': {'author_id': '61177521',
'item_id': '2092663780',
'name': 'Annalisa Merelli',
'url': 'https://qz.com/author/amerelliqz/'}}},
6: {'authors': {'52268529': {'author_id': '52268529',
'item_id': '2092922221',
'name': 'Aditya Chakrabortty',
'url': 'https://www.theguardian.com/profile/adityachakrabortty'}}},
7: {'authors': {'28083': {'author_id': '28083',
'item_id': '2096294305',
'name': 'Alana Semuels',
'url': ''}}},
8: {'authors': {'185472': {'author_id': '185472',
'item_id': '2097100251',
'name': 'TIM KREIDER',
'url': ''}}},
9: {'authors': {'2771923': {'author_id': '2771923',
'item_id': '2098788948',
'name': 'Richard Bernstein',
'url': 'http://www.nybooks.com/contributors/richard-bernstein/'}}},
10: {'authors': {'61111044': {'author_id': '61111044',
'item_id': '2102383890',
'name': 'Ephrat Livni',
'url': 'https://qz.com/author/livniqz/'}}}}
Any help much appreciated, I am very new to python and pandas.
Here is a proposal. You need to filter your secondary dictionary in order to ingest it into a dataframe.
input is your second dictionary.
authors_filtered = [v for v in zip(*[dict(item).values() for item in [input[i]['authors'] for i in input]])][0]
output = pd.DataFrame.from_dict(list(authors_filtered))