pandas json_normalize nested json where dictionary only exists on some records - json

I am trying to run pandas.json_normalize on a data file that has highly varied, nested json, where the content of the records can vary considerably.
I am processing a house listing file and trying to pull out prices. The prices data is stored as follows, and 'prices' is at the first nesting level within the json file:
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}, # followed by additional entries
I am using the following line of code, which works if I edit the input file down to a single record that includes a 'prices' section:
df3 = pd.json_normalize(df['records'], record_path='prices',
meta=['id'],
errors='ignore'
)
However, the full file includes many records that do not include a prices section. If I run the code against a file with 2 records (one with, one without), it fails with KeyError: 'prices'
Clearly the 'errors='ignore'' in the json_normalize is not enough to handle the error.
What can I do? I would just like to skip the records without prices entirely.

A list comprehension on your JSON will do it. I've synthesized some JSON to match your description of input data.
js = {
"records": [
{
"prices": [
{
"amountMax": 420000,
"amountMin": 420000,
"availability": "false",
"currency": "USD",
"dateSeen": [
"2020-12-21T11:57:17.190Z",
"2020-12-25T02:35:41.009Z"
],
"isSale": "false",
"isSold": "true",
"pricePerSquareFoot": 235,
"sourceURLs": [
"https://www.redfin.com/FL/Coconut-Creek/.../home/4146834"
]
}
],
"id": 1
},{"id":2}
]
}
pd.json_normalize({"records":[r for r in js["records"] if "prices" in r.keys()]}["records"],record_path="prices",meta="id")

Related

JSONPath to get multiple values from nested json

I have a nested JSON in a field that contains multiple important keys that I would like to retrieve as an array:
{
"tasks": [
{
"id": "task_1",
"name": "task_1_name",
"assignees": [
{
"id": "assignee_1",
"name": "assignee_1_name"
}
]
},
{
"id": "task_2",
"name": "task_2_name",
"assignees": [
{
"id": "assignee_2",
"name": "assignee_2_name"
},
{
"id": "assignee_3",
"name": "assignee_3_name"
}
]}]}
All the queries that I've tried so far fx ( $.tasks.*.assignees..id) and many others have returned
[
"assignee_1",
"assignee_2",
"assignee_3"
]
But what I need is:
[
["assignee_1"],
["assignee_2", "assignee_3"]
]
Is it possible to do with JSONPath or any script inside of it, without involving 3rd party tools?
The problem you're facing is that tasks and assignees are arrays. You need to use [*] instead of .* to get the items in the array. So your path should look like
$.tasks[*].assignees[*].id
You can try it at https://json-everything.net/json-path.
NOTE The output from my site will give you both the value and its location within the original document.
Edit
(I didn't read the whole thing :) )
You're not going to be able to get
[
["assignee_1"],
["assignee_2", "assignee_3"]
]
because, as #Tomalak mentioned, JSON Path is a query language. It's going to remove all structure and return only values.

Converting JSON Table Array Data in Azure Data Factory from Log Analytics REST API JSON Response to Multiple JSON Documents in same file

I'm currently trying to extract data out of Log Analytics through its REST API. I have been successful at using a Copy Data activity to store the response in an Azure Data Lake Gen 2 account.
The format is roughly similar to the example from the Log Analytics API Reference Page.
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Category",
"type": "string"
},
{
"name": "count_",
"type": "long"
}
],
"rows": [
[
"Administrative",
20839
],
[
"Recommendation",
122
],
[
"Alert",
64
],
[
"ServiceHealth",
11
]
]
}
] }
My dataset is much larger with more columns more values etc but the principals are the same.
What I am trying to do is generate a new JSON file that would hold the table but multiple documents in the same file e.g.
[{
"Category": "Administrative",
"count_": 20839
},
{
"Category": "Recommendation",
"count_": 122
},
{
"Category": "Alert",
"count_": 64
},
{
"Category": "ServiceHealth",
"count_": 11
}]
The output of this would be stored back into the data lake and then ideally could be used as a source for a copy activity to go into an Azure SQL Database.
I have tried accomplishing this using Data Flows Flattening but haven't been successful with this up until this point as when trying to map the column name it doesn't see individual column names just that level of the document where the column names are defined.
How would I go about flattening the dataset so it appears as desired? Is this an unrealistic expectation of Data flows or is this task more suitable for something like Azure Databricks?

Extract data from a JSON file using python

Say if I have JSON entry as follows(The JSON file generated by fetching data from a Firebase DB):
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
How do I extract the content corresponding to 'value' which is present inside column 'bills' . Any way to do this ?
My python code is as follows. With this I was only able to get data within bills column. But I need only the entry corresponding to 'value' which is present inside bills.
import json
filedata = open('firebase-dataset.json','r')
data = json.load(filedata)
listoffields = [] # To produce it into a list with fields
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
print(listoffields)
The JSON you posted contains misplaced quotes.
I think you are trying to extract the value of 'value' column within bills.
try this
print(listoffields[0][0]['value'])
which will print you 100.0 as str. use float() to use it in calculations.
---edit---
Say the JSON you having contains many JSON objects separated by commas as..
[{ first-entry },{ second-entry },{ third.. }, ....and so on]
..and you want to find the value of each bill in the each JSON obj..
may be the code below will work.-
bill_value_list = [] # to store 'value' of each bill
for bill_list in listoffields:
bill_value_list.append(float(bill_list[0]['value'])) # blill_list[0] will contain complete bill dictionary.
print(bill_value_list)
print(sum(bill_value_list)) # do something usefull
Paste it after the code you posted.(no changes to your code .. since it always works :-) )

Copy JSON Array data from REST data factory to Azure Blob as is

I have used REST to get data from API and the format of JSON output that contains arrays. When I am trying to copy the JSON as it is using copy activity to BLOB, I am only getting first object data and the rest is ignored.
In the documentation is says we can copy JSON as is by skipping schema section on both dataset and copy activity. I followed the same and I am the getting the output as below.
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#export-json-response-as-is
Tried copy activity without schema, using the header as first row and output files to BLOB as .json and .txt
Sample REST output:
{
"totalPages": 500,
"firstPage": true,
"lastPage": false,
"numberOfElements": 50,
"number": 0,
"totalElements": 636,
"columns": {
"dimension": {
"id": "variables/page",
"type": "string"
},
"columnIds": [
"0"
]
},
"rows": [
{
"itemId": "1234",
"value": "home",
"data": [
65
]
},
{
"itemId": "1235",
"value": "category",
"data": [
92
]
},
],
"summaryData": {
"totals": [
157
],
"col-max": [
123
],
"col-min": [
1
]
}
}
BLOB Output as the text is below: which is only first object data
totalPages,firstPage,lastPage,numberOfElements,number,totalElements
500,True,False,50,0,636
If you want to write the JSON response as is, you can use an HTTP connector. However, please note that the HTTP connector doesn't support pagination.
If you want to keep using the REST connector and to write a csv file as output, can you please specify how you want the nested objects and arrays to be written ?
In csv files, we can not write arrays. You could always use a custom activity or an azure function activity to call the REST API, parse it the way you want and write to a csv file.
Hope this helps.

Python 2.7: Generate JSON file with multiple query results in nested dict

What started as my personal initiative, ended up being a quiet interesting ( may I say, challenging to some degree) project. My company decided to phase out one product and replace it with new one, which instead of storing data in mdb files, uses JSON files. So I took the initiative to create a converter that will read already created mdb files and convert them into the new format JSON.
However, now I'm at wits-ends with this one:
I can read mdb files, run query to extract specific data.
By placing the targetobj inside the FOR LOOP, I managed to extract data for each row and fed into a dict(targetobj)
for val in rows:
targetobj={"connection_props": {"port": 7800, "service": "", "host": val.Hostname, "pwd": "", "username": ""},
"group_list": val.Groups, "cpu_core_cnt": 2, "target_name": "somename", "target_type": "somethingsamething",
"os": val.OS, "rule_list": [], "user_list": val.Users}
if I print targetobj to console, I can clearly get all extracted values for each row.
Now, my quest is to have the obtained results ( for each row), inserted into the main_dict under the key targets:[]. ( Please see sample of JSON file for illustration)
main_dict = {"changed_time": 0, "year": 0, "description": 'blahblahblah', 'targets':[RESULTS FROM TARGETOBJ SHOULD BE ADDED HERE],"enabled": False}
so for example my Json file should have structure such as:
{"changed_time":1234556,
"year":0,
"description":"blahblahblah",
"targets":[
{"group_list":["QA"],
"cpu_core_cnt":1,
"target_name":"NewTarget",
"os":"unix",
"target_type":"",
"rule_list":[],
"user_list":[""],"connection_props":"port":someport,"service":"","host":"host1","pwd":"","username":""}
},
{"group_list":[],
"cpu_core_cnt":2,
"target_name":"",
"os":"unix",
"target_type":"",
"rule_list":[],
"user_list":["Web2user"],
"connection_props":{"port":anotherport,"service":"","host":"host2","pwd":"","username":""}}
],
"enabled":false}
So far I've been tweaking here and there, to have the results written as intended, however each time,I'm getting only the last row values written.
ie.: putting the targetobj as a variable inside the targets:[]
{"changed_time": 0, "year": 0, "description": 'ConvertedConfigFile', 'targets':[targetobj],
I know I'm missing something, I just need to find what and where.
Any help would be highly appreciated.
thank you
Just create your main_dict first and append to it in your loop, i.e.:
main_dict = {"changed_time": 0,
"year": 0,
"description": "blahblahblah",
"targets": [], # a new list for the target objects
"enabled": False}
for val in rows:
main_dict["targets"].append({ # append this dict to the targets list of main_dict
"connection_props": {
"port": 7800,
"service": "",
"host": val.Hostname,
"pwd": "",
"username": ""},
"group_list": val.Groups,
"cpu_core_cnt": 2,
"target_name": "somename",
"target_type": "somethingsamething",
"os": val.OS,
"rule_list": [],
"user_list": val.Users
})