Extract data from a JSON file using python - json

Say if I have JSON entry as follows(The JSON file generated by fetching data from a Firebase DB):
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
How do I extract the content corresponding to 'value' which is present inside column 'bills' . Any way to do this ?
My python code is as follows. With this I was only able to get data within bills column. But I need only the entry corresponding to 'value' which is present inside bills.
import json
filedata = open('firebase-dataset.json','r')
data = json.load(filedata)
listoffields = [] # To produce it into a list with fields
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
print(listoffields)

The JSON you posted contains misplaced quotes.
I think you are trying to extract the value of 'value' column within bills.
try this
print(listoffields[0][0]['value'])
which will print you 100.0 as str. use float() to use it in calculations.
---edit---
Say the JSON you having contains many JSON objects separated by commas as..
[{ first-entry },{ second-entry },{ third.. }, ....and so on]
..and you want to find the value of each bill in the each JSON obj..
may be the code below will work.-
bill_value_list = [] # to store 'value' of each bill
for bill_list in listoffields:
bill_value_list.append(float(bill_list[0]['value'])) # blill_list[0] will contain complete bill dictionary.
print(bill_value_list)
print(sum(bill_value_list)) # do something usefull
Paste it after the code you posted.(no changes to your code .. since it always works :-) )

Related

pandas json_normalize nested json where dictionary only exists on some records

I am trying to run pandas.json_normalize on a data file that has highly varied, nested json, where the content of the records can vary considerably.
I am processing a house listing file and trying to pull out prices. The prices data is stored as follows, and 'prices' is at the first nesting level within the json file:
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}, # followed by additional entries
I am using the following line of code, which works if I edit the input file down to a single record that includes a 'prices' section:
df3 = pd.json_normalize(df['records'], record_path='prices',
meta=['id'],
errors='ignore'
)
However, the full file includes many records that do not include a prices section. If I run the code against a file with 2 records (one with, one without), it fails with KeyError: 'prices'
Clearly the 'errors='ignore'' in the json_normalize is not enough to handle the error.
What can I do? I would just like to skip the records without prices entirely.
A list comprehension on your JSON will do it. I've synthesized some JSON to match your description of input data.
js = {
"records": [
{
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}
],
"id": 1
},{"id":2}
]
}
pd.json_normalize({"records":[r for r in js["records"] if "prices" in r.keys()]}["records"],record_path="prices",meta="id")

Compare two nested json files and show user where exactly the change has occurred and which json file using Python?

I have two json files. I am validating the response is same or different. I need to show the user where there is an exact change. Some what like the particular key is added or removed or changed in this file.
file1.json
[
{
"Name": "Jack",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Assistant"
}
]
},
{
"Name": "MATHEW",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Worker"
}
]
}
]
file2.json
[
{
"Name": "Jack",
"region": "USA",
"tags": [
{
"name": "Name",
"value": "Manager"
}
]
},
{
"Name": "MATHEW",
"region": "US",
"tags": [
{
"name": "Name",
"value": "Assistant"
}
]
}
]
If you see Two JSON you can find the difference as a region in file2.json has changed US and Values changed from manager to assistant and worker. Now I want to show the user that file2.json has some changes like region :US and Manager changed to Assistant.
I have used deepdiff for validating purpose.
from deepdiff import DeepDiff
def difference(oldurl_resp,newurl_resp,file1):
ddiff = DeepDiff(oldurl_resp, newurl_resp,ignore_order=True)
if(ddiff == {}):
print("BOTH JSON FILES MATCH !!!")
return True
else:
print("FAILURE")
output = ddiff
if(output.keys().__contains__('iterable_item_added')):
test = output.get('iterable_item_added')
print('The Resource name are->')
i=[]
for k in test:
print("Name: ",test[k]['Name'])
print("Region: ",test[k]['region'])
msg= (" Name ->"+ test[k]['Name'] +" Region:"+test[k]['region'] +". ")
i.append(msg)
raise JsonCompareError("The json file has KEYS changed!. Please validate for below" +str(i) +"in "+file1)
elif(output.keys().__contains__('iterable_item_removed')):
test2 = output.get('iterable_item_removed')
print('The name are->')
i=[]
for k in test2:
print(test2[k]['Name'])
print(test2[k]['region'])
msg= (" Resource Name ->"+ test2[k]['Name'] +" Region:"+test2[k]['region'] +". ")
i.append(msg)
raise JsonCompareError("The json file has Keys Removed!!. Please validate for below" +str(i)+"in "+file1)
This code just shows the resource Name I want to show the tags also which got changed and added or removed.
Can anybody guide me
If you just print out the values of "test" variables, you will find out that "tag" variable changes are inside of it, test value of test in this example will be:
test = {'root[0]': {'region': 'USA', 'Name': 'Jack', 'tags': [{'name': 'Name', 'value': 'Manager'}]}, 'root[1]': {'region': 'US', 'Name': 'MATHEW', 'tags': [{'name': 'Name', 'value': 'Assistant'}]}}
and you can print test[k]['tags'] or add it your "msg" variable.
Suggestion:
Also, if your data has some primary key (for example they have "id", or their order is always fixed), you can compare their data 1 by 1 (instead of comparing whole lists) and you can have a better comparison. For example if you compare data of "Jack" together, you will have the following comparison:
{'iterable_item_removed': {"root['tags'][0]": {'name': 'Name', 'value': 'Assistant'}}, 'iterable_item_added': {"root['tags'][0]": {'name': 'Name', 'value': 'Manager'}}}
You should try the deepdiff library. It gives you the key where the difference occurs and the old and new value.
from deepdiff import DeepDiff
ddiff = DeepDiff(json_object1, json_object2)
# if you want to compare by ignoring order
ddiff = DeepDiff(json_object1, json_object2, ignore_order=True)

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

Python 2.7: Generate JSON file with multiple query results in nested dict

What started as my personal initiative, ended up being a quiet interesting ( may I say, challenging to some degree) project. My company decided to phase out one product and replace it with new one, which instead of storing data in mdb files, uses JSON files. So I took the initiative to create a converter that will read already created mdb files and convert them into the new format JSON.
However, now I'm at wits-ends with this one:
I can read mdb files, run query to extract specific data.
By placing the targetobj inside the FOR LOOP, I managed to extract data for each row and fed into a dict(targetobj)
for val in rows:
targetobj={"connection_props": {"port": 7800, "service": "", "host": val.Hostname, "pwd": "", "username": ""},
"group_list": val.Groups, "cpu_core_cnt": 2, "target_name": "somename", "target_type": "somethingsamething",
"os": val.OS, "rule_list": [], "user_list": val.Users}
if I print targetobj to console, I can clearly get all extracted values for each row.
Now, my quest is to have the obtained results ( for each row), inserted into the main_dict under the key targets:[]. ( Please see sample of JSON file for illustration)
main_dict = {"changed_time": 0, "year": 0, "description": 'blahblahblah', 'targets':[RESULTS FROM TARGETOBJ SHOULD BE ADDED HERE],"enabled": False}
so for example my Json file should have structure such as:
{"changed_time":1234556,
"year":0,
"description":"blahblahblah",
"targets":[
{"group_list":["QA"],
"cpu_core_cnt":1,
"target_name":"NewTarget",
"os":"unix",
"target_type":"",
"rule_list":[],
"user_list":[""],"connection_props":"port":someport,"service":"","host":"host1","pwd":"","username":""}
},
{"group_list":[],
"cpu_core_cnt":2,
"target_name":"",
"os":"unix",
"target_type":"",
"rule_list":[],
"user_list":["Web2user"],
"connection_props":{"port":anotherport,"service":"","host":"host2","pwd":"","username":""}}
],
"enabled":false}
So far I've been tweaking here and there, to have the results written as intended, however each time,I'm getting only the last row values written.
ie.: putting the targetobj as a variable inside the targets:[]
{"changed_time": 0, "year": 0, "description": 'ConvertedConfigFile', 'targets':[targetobj],
I know I'm missing something, I just need to find what and where.
Any help would be highly appreciated.
thank you
Just create your main_dict first and append to it in your loop, i.e.:
main_dict = {"changed_time": 0,
"year": 0,
"description": "blahblahblah",
"targets": [], # a new list for the target objects
"enabled": False}
for val in rows:
main_dict["targets"].append({ # append this dict to the targets list of main_dict
"connection_props": {
"port": 7800,
"service": "",
"host": val.Hostname,
"pwd": "",
"username": ""},
"group_list": val.Groups,
"cpu_core_cnt": 2,
"target_name": "somename",
"target_type": "somethingsamething",
"os": val.OS,
"rule_list": [],
"user_list": val.Users
})

Access deeper elements of a JSON using postgresql 9.4

I want to be able to access deeper elements stored in a json in the field json, stored in a postgresql database. For example, I would like to be able to access the elements that traverse the path states->events->time from the json provided below. Here is the postgreSQL query I'm using:
SELECT
data#>> '{userId}' as user,
data#>> '{region}' as region,
data#>>'{priorTimeSpentInApp}' as priotTimeSpentInApp,
data#>>'{userAttributes, "Total Friends"}' as totalFriends
from game_json
WHERE game_name LIKE 'myNewGame'
LIMIT 1000
and here is an example record from the json field
{
"region": "oh",
"deviceModel": "inHouseDevice",
"states": [
{
"events": [
{
"time": 1430247045.176,
"name": "Session Start",
"value": 0,
"parameters": {
"Balance": "40"
},
"info": ""
},
{
"time": 1430247293.501,
"name": "Mission1",
"value": 1,
"parameters": {
"Result": "Win ",
"Replay": "no",
"Attempt Number": "1"
},
"info": ""
}
]
}
],
"priorTimeSpentInApp": 28989.41467999999,
"country": "CA",
"city": "vancouver",
"isDeveloper": true,
"time": 1430247044.414,
"duration": 411.53,
"timezone": "America/Cleveland",
"priorSessions": 47,
"experiments": [],
"systemVersion": "3.8.1",
"appVersion": "14312",
"userId": "ef617d7ad4c6982e2cb7f6902801eb8a",
"isSession": true,
"firstRun": 1429572011.15,
"priorEvents": 69,
"userAttributes": {
"Total Friends": "0",
"Device Type": "Tablet",
"Social Connection": "None",
"Item Slots Owned": "12",
"Total Levels Played": "0",
"Retention Cohort": "Day 0",
"Player Progression": "0",
"Characters Owned": "1"
},
"deviceId": "ef617d7ad4c6982e2cb7f6902801eb8a"
}
That SQL query works, except that it doesn't give me any return values for totalFriends (e.g. data#>>'{userAttributes, "Total Friends"}' as totalFriends). I assume that part of the problem is that events falls within a square bracket (I don't know what that indicates in the json format) as opposed to a curly brace, but I'm also unable to extract values from the userAttributes key.
I would appreciate it if anyone could help me.
I'm sorry if this question has been asked elsewhere. I'm so new to postgresql and even json that I'm having trouble coming up with the proper terminology to find the answers to this (and related) questions.
You should definitely familiarize yourself with the basics of json
and json functions and operators in Postgres.
In the second source pay attention to the operators -> and ->>.
General rule: use -> to get a json object, ->> to get a json value as text.
Using these operators you can rewrite your query in the way which returns correct value of 'Total Friends':
select
data->>'userId' as user,
data->>'region' as region,
data->>'priorTimeSpentInApp' as priotTimeSpentInApp,
data->'userAttributes'->>'Total Friends' as totalFriends
from game_json
where game_name like 'myNewGame';
Json objects in square brackets are elements of a json array.
Json arrays may have many elements.
The elements are accessed by an index.
Json arrays are indexed from 0 (the first element of an array has an index 0).
Example:
select
data->'states'->0->'events'->1->>'name'
from game_json
where game_name like 'myNewGame';
-- returns "Mission1"
select
data->'states'->0->'events'->1->>'name'
from game_json
where game_name like 'myNewGame';
This did help me