Recursively Slicing JSON into Dataframe Columns - json

I have a dataframe with a column containing JSON in the format, where one record looks like -
player_feedback
{'player': '1b87a117-09ef-41e2-8710-6bc144760a74',
'feedback': [{'answer': [{'id': '1-6gaincareerinfo', 'content': 'To gain career information'},
{'id': '1-5proveskills', 'content': 'Opportunity to prove skills by competing '},
{'id': '1-1diff', 'content': 'Try something different'}], 'question': 1},
{'answer': [{'id': '2-2skilldev', 'content': 'Skill development'}], 'question': 2},
{'answer': [{'id': '3-6exploit', 'content': 'Exploitation'},
{'id': '3-1forensics', 'content': 'Forensics'}], 'question': 3},
{'answer': 'verygood', 'question': 4},
{'answer': 'poor', 'question': 5}, ... ... ,
{'answer': 'verygood', 'question': 15}]}
Here are the first 5 rows of the data.
I want to convert this column to separate columns like -
player Question 1 Question 2 ... Question 15
1b87a117-09ef-41e2-8710-6bc144760a74 To gain career information, Skill development verygood
Opportunity to prove skills by competing,
Try something different
I started with -
df_survey_responses['player_feedback'].apply(ast.literal_eval).values.tolist()
but that only gets me the player id in a seperate field and the feedback in another. As far as I can tell, JSONNormalize would also give me similar result. How can I do this recursively to get my desired result, or is a better way to do this?
Thanks!

You can use a json flattener to like this one:
def flatten_json(nested_json):
"""
Flatten json object with nested keys into a single level.
Args:
nested_json: A nested json object.
Returns:
The flattened json object if successful, None otherwise.
"""
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
Which gives dataframes that look like this:
0
player 34a8eb8a-056f-4568-88dc-8736056819a3
feedback_0_answer_0_id 1-5proveskills
feedback_0_answer_0_content Opportunity to prove skills by competing
feedback_0_question 1
feedback_1_answer_0_id 2-1networking
feedback_1_answer_0_content Networking
feedback_1_answer_1_id 2-2skilldev
feedback_1_answer_1_content Skill development
feedback_1_question 2
feedback_2_answer_0_id 3-5boottoroot
feedback_2_answer_0_content Boot2root
feedback_2_answer_1_id 3-6exploit
feedback_2_answer_1_content Exploitation
feedback_2_question 3
feedback_3_answer good
feedback_3_question 4
feedback_4_answer good
feedback_4_question 5
feedback_5_answer selfchose
feedback_5_question 6
feedback_6_answer pairs
feedback_6_question 7
feedback_7_answer_0_id 7-persistence
feedback_7_answer_0_content Persistence
feedback_7_question 8
feedback_8_answer social
feedback_8_question 9
feedback_9_answer training
feedback_9_question 10
feedback_10_answer yes
feedback_10_question 11
feedback_11_answer yes
feedback_11_question 12
feedback_12_answer yes
feedback_12_question 13
feedback_13_answer yes
feedback_13_question 14
feedback_14_answer verygood
feedback_14_question 15
feedback_15_answer yes
feedback_15_question 16
feedback_16_answer yes
feedback_16_question 17
feedback_17_answer It would be good to have more exploitation one...
feedback_17_question 18

Related

converting json into pandas dataframe

I have JSON output that I would like to convert to pandas dataframe. I downloaded from a website via HTTPS and utilizing an API key. thanks much. here is what I coded:
json_data = vehicle_miles_traveled.json()
print(json_data)
{'request': {'command': 'series', 'series_id': 'STEO.MVVMPUS.A'}, 'series': [{'series_id': 'STEO.MVVMPUS.A', 'name': 'Vehicle Miles Traveled, Annual', 'units': 'million miles/day', 'f': 'A', 'description': 'Includes gasoline and diesel fuel vehicles', 'copyright': 'None', 'source': 'U.S. Energy Information Administration (EIA) - Short Term Energy Outlook', 'geography': 'USA', 'start': '1990', 'end': '2023', 'lastHistoricalPeriod': '2021', 'updated': '2022-03-08T12:39:35-0500', 'data': [['2023', 9247.0281671], ['2022', 9092.4575671], ['2021', 8846.1232877], ['2020', 7933.3907104], ['2019', 8936.3589041], ['2018', 8877.6027397], ['2017', 8800.9479452], ['2016', 8673.2431694], ['2015', 8480.4712329], ['2014', 8289.4684932], ['2013', 8187.0712329], ['2012', 8110.8387978], ['2011', 8083.2931507], ['2010', 8129.4958904], ['2009', 8100.7205479], ['2008', 8124.3387978], ['2007', 8300.8794521], ['2006', 8257.8520548], ['2005', 8190.2136986], ['2004', 8100.5163934], ['2003', 7918.4136986], ['2002', 7823.3123288], ['2001', 7659.2054795], ['2000', 7505.2622951], ['1999', 7340.9808219], ['1998', 7192.7780822], ['1997', 7014.7205479], ['1996', 6781.9699454], ['1995', 6637.7369863], ['1994', 6459.1452055], ['1993', 6292.3424658], ['1992', 6139.7595628], ['1991', 5951.2712329], ['1990', 5883.5643836]]}]}
It hugely depends on your final goal. You could add all meta-data in a dataframe if you want to. I assume that you are interested in reading the data field into a dataframe.
We can just get those fields by accessing:
data = json_data['series'][0]['data']
# and pass them to the dataframe constructor. We can specify the column names as well!
df = pd.DataFrame(data, columns=['year', 'other_col_name'])

How to read dynamic/changing keys from JSON in Python

I am using the UK Bus API to collect bus arrival times etc.
In Python 3 I have been using....
try:
connection = http.client.HTTPSConnection("transportapi.com")
connection.request("GET", "/v3/uk/bus/stop/xxxxxxxxx/live.json?app_id=xxxxxxxxxxxxxxx&app_key=xxxxxxxxxxxxxxxxxxxxxxxxxxx&group=route&nextbuses=yes")
res = connection.getresponse()
data = res.read()
connection.close()
from types import SimpleNamespace as Namespace
x = json.loads(data, object_hook=lambda d: Namespace(**d))
print("Stop Name : " + x.stop_name)
Which is all reasonably simple, however the JSON data returned looks like this...
{
"atcocode":"xxxxxxxx",
"smscode":"xxxxxxxx",
"request_time":"2020-03-10T15:42:22+00:00",
"name":"Hospital",
"stop_name":"Hospital",
"bearing":"SE",
"indicator":"adj",
"locality":"Here",
"location":{
"type":"Point",
"coordinates":[
-1.xxxxx,
50.xxxxx
]
},
"departures":{
"8":[
{
"mode":"bus",
"line":"8",
"line_name":"8",
"direction":"North",
"operator":"CBLE",
"date":"2020-03-10",
Under "departures" the key name changes due to the bus number / code.
Using Python 3 how do I extract the key name and all subsequent values below/within it?
Many thanks for any help!
You can do this:
for k,v in x["departures"].items():
print(k,v) #or whatever else you wanted to do.
Which returns:
8 [{'mode': 'bus', 'line': '8', 'line_name': '8', 'direction': 'North', 'operator': 'CBLE', 'date': '2020-03-10'}]
So k is equal to 8 and v is the value.

transform json column in a dataframe

I have a dataframe in which two columns are JSON objects. Something like this:
id choice name host
002 {'option': 'true'} Bob {'city': {'name': 'A'}}
003 {'option': 'false'} Ana {'city': {'name': 'B'}}
004 {'option': 'false'} Nic {'city': {'name': 'C'}}
I wish for the column result to only be the final string in columns choice and host (true, false, A, B, C...)
i was able to do it to column host with the following formula
df['host'] = (df.loc[:, 'host']
.apply(lambda x: x['city']['name']))
This was succesful. However, when i apply something similar to column choice
df['choice'] = (df.loc[:, 'choice']
.apply(lambda x: x['option']))
i get TypeError: 'NoneType' object is not subscriptable
How could i get a choice column with "true" and "false"?
Let us use str.get
df.choice.str.get('option')
0 true
1 false
2 false
Name: choice, dtype: object
df.host.str.get('city').str.get('name')
0 A
1 B
2 C
Name: host, dtype: object
First make sure they are object in your two columns , dict if not , do the conversion via ast.literal_eval
import ast
df[['choice','host']]=df[['choice','host']].applymap(ast.literal_eval)

json.loads function not giving python dictionary

I am trying to convert the below mentioned json string to python dictionary. I am using python 3's json package for the same. Here is the code that I am using :
a = "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]"
b = json.loads(json.dumps(a))
print(type(b))
And the output that I am getting from the above code is:
<class 'str'>
I saw the similar questions asked in stackoverflow, but the solutions presented for those questions do not apply to my case.
The json string that you are trying to convert is not properly formatted. Also, you need to only call json.loads to convert string into dict or list.
The updated code would look like:
import json
a = '[{"id": 35, "name": "Comedy"}, {"id": 18, "name": "Drama"}, {"id": 10751, "name": "Family"}, {"id": 10749, "name": "Romance"}]'
b = json.loads(a)
print(type(b))
Hope this explains why you are not getting the expected results.
JSON Array is enclosed in [ ] while JSON object is enclosed in { }
The string in a is a json array so you can change that into a list only.
Your key and value should be enclosed with double quotes, that's the requirement to use json library of python.
b = json.loads(a) will give a list of dictionary objects.
To get further dictionary of dictionary you need to associate a key with each individual dictionary.
d = dict()
ind = 0
for data in b:
d[ind] = data
ind+=1
Now the output that you get will be
{0: {'id': 35, 'name': 'Comedy'}, 1: {'id': 18, 'name': 'Drama'}, 2: {'id': 10751, 'name': 'Family'}, 3: {'id': 10749, 'name': 'Romance'}}
which is a dictionary of dictionary.
Thank you

Why does dask.bag.read_text(filename).map(json.loads) return a list?

I need to read several json.gz files using Dask. I am trying to achieve this by using dask.bag.read_text(filename).map(json.loads), but the output is a nested list (the files contain lists of dictionaries), whereas I would like to get a just a list of dictionaries.
I have included a small example that reproduces my problem, below.
import json
import gzip
import dask.bag as db
dict_list = [{'id': 123, 'name': 'lemurt', 'indices': [1,10]}, {'id': 345, 'name': 'katin', 'indices': [2,11]}]
filename = './test.json.gz'
# Write json
with gzip.open(filename, 'wt') as write_file:
json.dump(dict_list , write_file)
# Read json
with gzip.open(filename, "r") as read_file:
data = json.load(read_file)
# Read json with Dask
data_dask = db.read_text(filename).map(json.loads).compute()
print(data)
print(data_dask)
I would like to get the first output:
[{'id': 123, 'name': 'lemurt', 'indices': [1, 10]}, {'id': 345, 'name': 'katin', 'indices': [2, 11]}]
But instead I get the second one:
[[{'id': 123, 'name': 'lemurt', 'indices': [1, 10]}, {'id': 345, 'name': 'katin', 'indices': [2, 11]}]]
The read_text function returns a bag, where each element is a line of text. So you have a list of strings. Then, you parse each of those lines of text with json.loads, so each of those lines of text becomes a list again. So you have a list of lists.
In your case you might use map_partitions, and a function that expects a list of a single line of text
b = db.read_text("*.json.gz").map(lambda L: json.loads(L[0]))
Following the comment by #MRocklin, I ended up solving my problem by changing the way I was writing the json.gz files.
Instead of
with gzip.open(filename, 'wt') as write_file:
json.dump(dict_list , write_file)
I used
with gzip.open(filename, 'wt') as write_file:
for dd in dict_list:
json.dump(dd , write_file)
write_file.write("\n")
and kept reading the files as
db.read_text(filename).map(json.loads)