transform json column in a dataframe - json

I have a dataframe in which two columns are JSON objects. Something like this:
id choice name host
002 {'option': 'true'} Bob {'city': {'name': 'A'}}
003 {'option': 'false'} Ana {'city': {'name': 'B'}}
004 {'option': 'false'} Nic {'city': {'name': 'C'}}
I wish for the column result to only be the final string in columns choice and host (true, false, A, B, C...)
i was able to do it to column host with the following formula
df['host'] = (df.loc[:, 'host']
.apply(lambda x: x['city']['name']))
This was succesful. However, when i apply something similar to column choice
df['choice'] = (df.loc[:, 'choice']
.apply(lambda x: x['option']))
i get TypeError: 'NoneType' object is not subscriptable
How could i get a choice column with "true" and "false"?

Let us use str.get
df.choice.str.get('option')
0 true
1 false
2 false
Name: choice, dtype: object
df.host.str.get('city').str.get('name')
0 A
1 B
2 C
Name: host, dtype: object
First make sure they are object in your two columns , dict if not , do the conversion via ast.literal_eval
import ast
df[['choice','host']]=df[['choice','host']].applymap(ast.literal_eval)

Related

Trying to merge two json using python dictionary

I am trying to merge two json stored in python dictionaries. Here, the first dictionary is the parent into which the second dictionary gets merged into. In reality, the second dictionary represents a second line in a TSV file that holds the next record of a json array.
My program is trying to read the TSV file line by line and merging them into one single nested json.
Let us consider the two dictionaries:
Parent dictionary dict1: {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
and,
dict2: {"CA": {"Marin": {"zip":2}}}
Note: dict1 is the source-of-truth with regards to the correct json structure.
Here, as you can see, I would like to append the zip: 2 into the Marin county of California state.
Here is my merge code:
class MyClass:
# merges two lines containing json arrays inside nested json
def merge_lines(dict1: dict, dict2: dict) -> dict:
for key in dict1:
if key in dict2:
if isinstance(dict1[key], dict):
MyClass.merge_lines(dict1[key], dict2[key])
elif isinstance(dict1[key], list):
to_be_merged_list = [dict2[key]]
dict1[key].extend(to_be_merged_list)
return dict1
Below is how I am trying to test:
def test_nested_json_arrays(self):
d1 = {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
d2 = {"CA": {"Marin": {"zip":2}}}
expected_result = {"CA": [{"Marin": [{"zip":1}, {"zip":2}], "population":1000}]}
actual = MyClass.merge_lines(d1, d2)
assert expected_result == actual
However, I am getting the below result:
E AssertionError: assert {'CA': [{'Mar...tion': 1000}]} == {'CA': [{'Mari... {'zip': 2}}]}
E Differing items:
E {'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 1000}]} != {'CA': [{'Marin': [{'zip': 1}], 'population': 10000}, {'Marin': {'zip': 2}}]}
E Full diff:
E - {'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 1000}]}
E + {'CA': [{'Marin': [{'zip': 1}], 'population': 10000}, {'Marin': {'zip': 2}}]}
Can someone help me figure out the changes required in the code to fix this?
Note: the field names are not constant and this can apply to any combination of country, state, county, zip and other nested attributes.
dict1 = {"CA": [{"Marin": [{"zip":1}], "population":10000}]}
dict2 = {"CA": {"Marin": {"zip":2}}}
Looking at dict2:
You are given dict2, containing key k (in this case k = "CA")
dict2[k] itself is a dictionary, that contains one (or more) key (c = "Marin") - value (z) pair(s)
z now is the dictionary that you care about.
Looking at dict1:
For each element county_info in dict1[k], you care about the one that has a key c.
The value at this key (county_info[c]) is a list, to which you want to append z
So let's do that:
def merge_lines(dict1, dict2):
for k, v in dict2.items():
for c, z in v.items():
# Find the element of dict1[k] that has the key c:
for county_info in dict1[k]:
if c in county_info:
county_info[c].append(z)
break
Since the function modifies dict1 in-place, running merge_lines(dict1, dict2) gives us a modified dict1 that looks like what you expect:
{'CA': [{'Marin': [{'zip': 1}, {'zip': 2}], 'population': 10000}]}
This is a more generic approach, where d1 is updated with changes contained in d2.
Of course, you should provide more examples of updates from d2 to see if my solution works in all cases.
#!/usr/bin/env python3
d1 = {"CA": [{"Marin": [{"zip": 1}], "population": 10000}]}
d2 = {"CA": {"Marin": {"zip": 2}}}
# read update data from d2 and append new values to d1
def update(d1, d2):
for key, value in d2.items():
if key in d1:
for key1, value1 in value.items():
if key1 in d1[key][0]:
d1[key][0][key1].append(value1)
else:
d1[key][0][key1] = [value1]
else:
d1[key] = [value]
return d1
print(update(d1, d2))

Filter one dictionary out of list of dictionaries and extract values of certain keys

Objective:
To filter out of list of dictionaries and extract specific values in that dictionary using python 3.x
Code:
output = data.decode("utf-8")
input = json.loads(output)
for itm in input['items']:
print(itm)
The above code output:
{'name': 'ufg', 'id': '0126ffc8-a1b1-423e-b7fe-56d4e93a80d6', 'created_at': '2022-06-16T04:37:32.958Z'}
{'name': 'xyz', 'id': '194ac74b-54ac-45c6-b4d3-c3ae3ebc1d27', 'created_at': '2022-06-26T10:32:50.307Z'}
{'name': 'defg', 'id': '3744bdaa-4e74-46f6-bccb-1dc2eca2d2c1', 'created_at': '2022-06-26T10:55:21.273Z'}
{'name': 'abcd', 'id': '41541893-f916-426b-b135-c7500759b0b3', 'created_at': '2022-06-24T08:39:39.806Z'}
Now Need to filter only as output for example, I want only dictionary with 'name' as 'abcd'
expected filtered output:
{'name': 'abcd', 'id': '41541893-f916-426b-b135-c7500759b0b3', 'created_at': '2022-06-24T08:39:39.806Z'}
Now, I need to extract only 'name' and 'id' for this 'abcd' into python variable to use it next part of program
Please suggest.
For your situation probably converting into dictionary then filtering would be best.
for itm in input['items']:
itm = dict(itm)
if(itm["name"] == "abcd"):
print(itm["name"],itm["id"])
however, if itm = dict(itm) wouldn't work for your situation, you can use json.loads(itm) then itm = dict(itm).

How to read dynamic/changing keys from JSON in Python

I am using the UK Bus API to collect bus arrival times etc.
In Python 3 I have been using....
try:
connection = http.client.HTTPSConnection("transportapi.com")
connection.request("GET", "/v3/uk/bus/stop/xxxxxxxxx/live.json?app_id=xxxxxxxxxxxxxxx&app_key=xxxxxxxxxxxxxxxxxxxxxxxxxxx&group=route&nextbuses=yes")
res = connection.getresponse()
data = res.read()
connection.close()
from types import SimpleNamespace as Namespace
x = json.loads(data, object_hook=lambda d: Namespace(**d))
print("Stop Name : " + x.stop_name)
Which is all reasonably simple, however the JSON data returned looks like this...
{
"atcocode":"xxxxxxxx",
"smscode":"xxxxxxxx",
"request_time":"2020-03-10T15:42:22+00:00",
"name":"Hospital",
"stop_name":"Hospital",
"bearing":"SE",
"indicator":"adj",
"locality":"Here",
"location":{
"type":"Point",
"coordinates":[
-1.xxxxx,
50.xxxxx
]
},
"departures":{
"8":[
{
"mode":"bus",
"line":"8",
"line_name":"8",
"direction":"North",
"operator":"CBLE",
"date":"2020-03-10",
Under "departures" the key name changes due to the bus number / code.
Using Python 3 how do I extract the key name and all subsequent values below/within it?
Many thanks for any help!
You can do this:
for k,v in x["departures"].items():
print(k,v) #or whatever else you wanted to do.
Which returns:
8 [{'mode': 'bus', 'line': '8', 'line_name': '8', 'direction': 'North', 'operator': 'CBLE', 'date': '2020-03-10'}]
So k is equal to 8 and v is the value.

Recursively Slicing JSON into Dataframe Columns

I have a dataframe with a column containing JSON in the format, where one record looks like -
player_feedback
{'player': '1b87a117-09ef-41e2-8710-6bc144760a74',
'feedback': [{'answer': [{'id': '1-6gaincareerinfo', 'content': 'To gain career information'},
{'id': '1-5proveskills', 'content': 'Opportunity to prove skills by competing '},
{'id': '1-1diff', 'content': 'Try something different'}], 'question': 1},
{'answer': [{'id': '2-2skilldev', 'content': 'Skill development'}], 'question': 2},
{'answer': [{'id': '3-6exploit', 'content': 'Exploitation'},
{'id': '3-1forensics', 'content': 'Forensics'}], 'question': 3},
{'answer': 'verygood', 'question': 4},
{'answer': 'poor', 'question': 5}, ... ... ,
{'answer': 'verygood', 'question': 15}]}
Here are the first 5 rows of the data.
I want to convert this column to separate columns like -
player Question 1 Question 2 ... Question 15
1b87a117-09ef-41e2-8710-6bc144760a74 To gain career information, Skill development verygood
Opportunity to prove skills by competing,
Try something different
I started with -
df_survey_responses['player_feedback'].apply(ast.literal_eval).values.tolist()
but that only gets me the player id in a seperate field and the feedback in another. As far as I can tell, JSONNormalize would also give me similar result. How can I do this recursively to get my desired result, or is a better way to do this?
Thanks!
You can use a json flattener to like this one:
def flatten_json(nested_json):
"""
Flatten json object with nested keys into a single level.
Args:
nested_json: A nested json object.
Returns:
The flattened json object if successful, None otherwise.
"""
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
Which gives dataframes that look like this:
0
player 34a8eb8a-056f-4568-88dc-8736056819a3
feedback_0_answer_0_id 1-5proveskills
feedback_0_answer_0_content Opportunity to prove skills by competing
feedback_0_question 1
feedback_1_answer_0_id 2-1networking
feedback_1_answer_0_content Networking
feedback_1_answer_1_id 2-2skilldev
feedback_1_answer_1_content Skill development
feedback_1_question 2
feedback_2_answer_0_id 3-5boottoroot
feedback_2_answer_0_content Boot2root
feedback_2_answer_1_id 3-6exploit
feedback_2_answer_1_content Exploitation
feedback_2_question 3
feedback_3_answer good
feedback_3_question 4
feedback_4_answer good
feedback_4_question 5
feedback_5_answer selfchose
feedback_5_question 6
feedback_6_answer pairs
feedback_6_question 7
feedback_7_answer_0_id 7-persistence
feedback_7_answer_0_content Persistence
feedback_7_question 8
feedback_8_answer social
feedback_8_question 9
feedback_9_answer training
feedback_9_question 10
feedback_10_answer yes
feedback_10_question 11
feedback_11_answer yes
feedback_11_question 12
feedback_12_answer yes
feedback_12_question 13
feedback_13_answer yes
feedback_13_question 14
feedback_14_answer verygood
feedback_14_question 15
feedback_15_answer yes
feedback_15_question 16
feedback_16_answer yes
feedback_16_question 17
feedback_17_answer It would be good to have more exploitation one...
feedback_17_question 18

Getting values from Json data in Python

I have a json file that I am trying to pull specific attribute data from. The Json data is essentially a dictionary. Before the data is turned into a file, it is first held in a variable like this:
params = {'f': 'json', 'where': '1=1', 'geometryType': 'esriGeometryPolygon', 'spatialRel': 'esriSpatialRelIntersects','outFields': '*', 'returnGeometry': 'true'}
r = requests.get('https://hazards.fema.gov/gis/nfhl/rest/services/CSLF/Prelim_CSLF/MapServer/3/query', params)
cslfJson = r.json()
and then written into a file like this:
path = r"C:/Workspace/Sandbox/ScratchTests/cslf.json"
with open(path, 'w') as f:
json.dump(cslfJson, f, indent=2)
within this json data is an attribute called DFIRM_ID. I want to create an empty list called dfirm_id = [], get all of the values for DFIRM_ID and for that value, append it to the list like this dfirm_id.append(value). I am thinking I need to somehow read through the json variable data or the actual file, but I am not sure how to do it. Any suggestions on an easy method to accomplish this?
dfirm_id = []
for k, v in cslf:
if cslf[k] == 'DFIRM_ID':
dfirm.append(cslf[v])
As requested, here is what print(cslfJson) looks like:
It actually prints a huge dictionary that looks like this:
{'displayFieldName': 'CSLF_ID', 'fieldAliases': {'OBJECTID':
'OBJECTID', 'CSLF_ID': 'CSLF_ID', 'Area_SF': 'Area_SF', 'Pre_Zone':
'Pre_Zone', 'Pre_ZoneST': 'Pre_ZoneST', 'PRE_SRCCIT': 'PRE_SRCCIT',
'NEW_ZONE': 'NEW_ZONE', 'NEW_ZONEST': .... {'attributes': {'OBJECTID':
26, 'CSLF_ID': '13245C_26', 'Area_SF': 5.855231804165408e-05,
'Pre_Zone': 'X', 'Pre_ZoneST': '0.2 PCT ANNUAL CHANCE FLOOD HAZARD',
'PRE_SRCCIT': '13245C_STUDY1', 'NEW_ZONE': 'A', 'NEW_ZONEST': None,
'NEW_SRCCIT': '13245C_STUDY2', 'CHHACHG': 'None (Zero)', 'SFHACHG':
'Increase', 'FLDWYCHG': 'None (Zero)', 'NONSFHACHG': 'Decrease',
'STRUCTURES': None, 'POPULATION': None, 'HUC8_CODE': None, 'CASE_NO':
None, 'VERSION_ID': '2.3.3.3', 'SOURCE_CIT': '13245C_STUDY2', 'CID':
'13245C', 'Pre_BFE': -9999, 'Pre_BFE_LEN_UNIT': None, 'New_BFE':
-9999, 'New_BFE_LEN_UNIT': None, 'BFECHG': 'False', 'ZONECHG': 'True', 'ZONESTCHG': 'True', 'DFIRM_ID': '13245C', 'SHAPE_Length':
0.009178426056888393, 'SHAPE_Area': 4.711699932249018e-07, 'UID': 'f0125a91-2331-4318-9a50-d77d042a48c3'}}, {'attributes': .....}
If your json data is already a dictionary, then take advantage of that. The beauty of a dictionary / hashmap is that it provides an average time complexity of O(1).
Based on your comment, I believe this will solve your problem:
dfirm_id = []
for feature in cslf['features']:
dfirm_id.append(feature['attributes']['DFIRM_ID'])