reshape jq nested file and make csv - json

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}

Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json

Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

Related

Generate a separate CSV record for each array element

I have a JSON:
{
"Country": "USA",
"State": "TX",
"Employees": [
{
"Name": "Name1",
"address": "SomeAdress1"
}
]
}
{
"Country": "USA",
"State": "FL",
"Employees": [
{
"Name": "Name2",
"address": "SomeAdress2"
},
{
"Name": "Name3",
"address": "SomeAdress3"
}
]
}
{
"Country": "USA",
"State": "CA",
"Employees": [
{
"Name": "Name4",
"address": "SomeAdress4"
}
]
}
I want to use jq to get the following result in csv format:
Country, State, Name, Address
USA, TX, Name1, SomeAdress1
USA, FL, Name2, SomeAdress2
USA, FL, Name3, SomeAdress3
USA, CA, Name4, SomeAdress4
I have got the following jq:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
And I get the following with 2nd line having more columns than required. I want these extra columns in a separate row:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
And I want the following result:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2"
"USA","FL","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
You need to generate a separate array for each employee.
[.Country, .State] + (.Employees[] | [.Name, .address]) | #csv
Online demo
You can store root object in a variable, and then expand the Employees arrays:
$ jq -r '. as $root | .Employees[]|[$root.Country, $root.State, .Name, .address] | #csv'
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2"
"USA","FL","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
The other answers are good, but I want to talk about why your attempt doesn't work, as well as why it seems like it should.
You are wondering why this:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
produces this:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
perhaps because this:
jq '{Country:.Country,State:.State,Name:(.Employees[]|.Name)}'
produces this:
{
"Country": "USA",
"State": "TX",
"Name": "Name1"
}
{
"Country": "USA",
"State": "FL",
"Name": "Name2"
}
{
"Country": "USA",
"State": "FL",
"Name": "Name3"
}
{
"Country": "USA",
"State": "CA",
"Name": "Name4"
}
It turns out the difference is in what exactly [...] and {...} do in a jq filter. In the array constructor [...], the entire contents of the square brackets, commas and all, is a single filter, which is fully evaluated and all the results combined into one array. Each comma inside is simply the sequencing operator, which means generate all the values from the filter on its left, then all the values from the filter on its right. In contrast, the commas in the {...} object constructor are part of the syntax and just separate the fields of the object. If any of the field expressions yield multiple values then multiple whole objects are produced. If multiple field expressions yield multiple value then you get a whole object for every combination of yielded values.
When you do this:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
^ ^ ^
1 2 3
the problem is that the commas labelled "1", "2" and "3" are all doing the same thing, evaluating all the values for the filter on the left, then all the values for the filter on the right. Then the array constructor catches all of them and produces a single array. The array constructor will never create more than one array for one input.
So with that in mind, you need to make sure that where you're expanding out .Employees[] isn't inside your array constructor. Here's another option to add to the answers you already have:
jq -r '.Employee=.Employees[]|[.Country,.State,.Employee.Name,.Employee.address]|#csv'
demo
or indeed:
jq -r '.Employees[] as $e|[.Country,.State,$e.Name,$e.address]|#csv'
demo

Extract data from a JSON file using python

Say if I have JSON entry as follows(The JSON file generated by fetching data from a Firebase DB):
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
How do I extract the content corresponding to 'value' which is present inside column 'bills' . Any way to do this ?
My python code is as follows. With this I was only able to get data within bills column. But I need only the entry corresponding to 'value' which is present inside bills.
import json
filedata = open('firebase-dataset.json','r')
data = json.load(filedata)
listoffields = [] # To produce it into a list with fields
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
print(listoffields)
The JSON you posted contains misplaced quotes.
I think you are trying to extract the value of 'value' column within bills.
try this
print(listoffields[0][0]['value'])
which will print you 100.0 as str. use float() to use it in calculations.
---edit---
Say the JSON you having contains many JSON objects separated by commas as..
[{ first-entry },{ second-entry },{ third.. }, ....and so on]
..and you want to find the value of each bill in the each JSON obj..
may be the code below will work.-
bill_value_list = [] # to store 'value' of each bill
for bill_list in listoffields:
bill_value_list.append(float(bill_list[0]['value'])) # blill_list[0] will contain complete bill dictionary.
print(bill_value_list)
print(sum(bill_value_list)) # do something usefull
Paste it after the code you posted.(no changes to your code .. since it always works :-) )

How to parse nested json and write in Redshift?

I have a following json structure like this:
{
"firstname": "A",
"lastname": "B",
"age": 24,
"address": {
"streetAddress": "123",
"city": "San Jone",
"state": "CA",
"postalCode": "394221"
},
"phonenumbers": [
{ "type": "home", "number": "123456789" }
{ "type": "mobile", "number": "987654321" }
]
}
I need to copy this json from S3 to a Redshift table.
I am currently using copy command with a path file but it loads array as a single column.
I wanted the nested array to be parsed and the table should like this:
firstname|lastname|age|streetaddress|city |state|postalcode|type|number
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |home|123456789
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |mob|987654321
Is there a way to do that?
You can do use nested JSON paths by making use of JSON path files. However, this does not work with the multiple phone number types.
If you can modify the dataset to have multiple records (one for mobile, one for home) then your file would look similar to the below.
{
"jsonpaths": [
"$.firstname",
"$.lastname",
"$.venuestate",
"$.age",
"$.address.streetAddress",
"$.address.city",
"$.address.state",
"$.address.postalCode",
"$.phonenumbers[0].type",
"$.phonenumbers[0].number"
]
}
If you are unable to change the format you will need to perform an ETL task upon load before it can be consumed by Redshift. For this you could use an event for creation of objects to trigger a Lambda function and then perform the ETL process for you before it loads into Redshift.

Parse JSON to CSV using jq but split sub list in multiple records

I am parsing a JSON result which i get from the Azure RateAPI and want to convert it into a standard CSV file.
The following line is what i am using to convert it into CSV and it works but as one of the attributes is a list, it does not provide me the result i am seeking. For every item in the "sub list", i would need to create another record in my csv file.
cat myfile.json | jq -r '.Meters[] | [ .EffectiveDate, .IncludedQuantity, .MeterCategory, .MeterId, .MeterName, .MeterRates[], .MeterRegion, .MeterStatus, .MeterSubCategory, .MeterTags[], .Units] | #csv'
Here are 3 records I am trying to parse. I am having trouble with record 2 because MeterRates is actually the list where i need both, the attribute and the value. I would need record 2, once parsed, to correspond to 3 records in the CSV file where each record contains one item of the list in the MeterRates. An example of expected result is at the end
"OfferTerms": [],
"Meters": [
{
"EffectiveDate": "2019-03-01T00:00:00Z",
"IncludedQuantity": 0,
"MeterCategory": "Virtual Machines",
"MeterId": "d0bf9053-17c4-4fec-8502-4eb8376343a7",
"MeterName": "F2/F2s Low Priority",
"MeterRates": {
"0": 0.0766
},
"MeterRegion": "US West 2",
"MeterStatus": "Active",
"MeterSubCategory": "F/FS Series Windows",
"MeterTags": [],
"Unit": "1 Hour"
},
{
"EffectiveDate": "2014-11-01T00:00:00Z",
"IncludedQuantity": 0,
"MeterCategory": "Azure DevOps",
"MeterId": "c4d6fa88-0df9-4680-867a-b13c960a875f",
"MeterName": "Virtual User Minute",
"MeterRates": {
"0": 0.0004,
"1980000": 0.0002,
"9980000": 0.0001
},
"MeterRegion": "",
"MeterStatus": "Active",
"MeterSubCategory": "Cloud-Based Load Testing",
"MeterTags": [],
"Unit": "1/Month"
},
{
"EffectiveDate": "2017-04-01T00:00:00Z",
"IncludedQuantity": 0,
"MeterCategory": "SQL Database",
"MeterId": "cb770eab-d5c8-45fd-ac56-8c35069f5a29",
"MeterName": "P4 DTUs",
"MeterRates": {
"0": 68.64
},
"MeterRegion": "IN West",
"MeterStatus": "Active",
"MeterSubCategory": "Single Premium",
"MeterTags": [],
"Unit": "1/Day"
}
]
}
Actual results using the code i provided is the following:
"2019-03-01T00:00:00Z",0,"Virtual Machines","d0bf9053-17c4-4fec-8502-4eb8376343a7","F2/F2s Low Priority",0.0766,"US West 2","Active","F/FS Series Windows",
"2014-11-01T00:00:00Z",0,"Azure DevOps","c4d6fa88-0df9-4680-867a-b13c960a875f","Virtual User Minute",0.0004,0.0002,0.0001,"","Active","Cloud-Based Load Testing",
"2017-04-01T00:00:00Z",0,"SQL Database","cb770eab-d5c8-45fd-ac56-8c35069f5a29","P4 DTUs",68.64,"IN West","Active","Single Premium",
but the result i would expect is (record 2 to correspond to 3 records in the CSV file based on MeterRates):
"2019-03-01T00:00:00Z",0,"Virtual Machines","d0bf9053-17c4-4fec-8502-4eb8376343a7","F2/F2s Low Priority",0,0.0766,"US West 2","Active","F/FS Series Windows",
"2014-11-01T00:00:00Z",0,"Azure DevOps","c4d6fa88-0df9-4680-867a-b13c960a875f","Virtual User Minute",0,0.0004,"","Active","Cloud-Based Load Testing",
"2014-11-01T00:00:00Z",0,"Azure DevOps","c4d6fa88-0df9-4680-867a-b13c960a875f","Virtual User Minute",1980000,0.0002,"","Active","Cloud-Based Load Testing",
"2014-11-01T00:00:00Z",0,"Azure DevOps","c4d6fa88-0df9-4680-867a-b13c960a875f","Virtual User Minute",9980000,0.0001"","Active","Cloud-Based Load Testing",
"2017-04-01T00:00:00Z",0,"SQL Database","cb770eab-d5c8-45fd-ac56-8c35069f5a29","P4 DTUs",0,68.64,"IN West","Active","Single Premium",
Thank you for your help.
You'll want to add a step between getting Meters items and outputting rows, to output the various combinations of Meters items with different rates. As you have it right now, you're outputting the rates as other items for the row which isn't really what you want.
In this case, you could just add a new property to hold the value of the corresponding MeterRate.
.Meters[] | .MeterRate = (.MeterRates | to_entries[])
| [.EffectiveDate, .IncludedQuantity, .MeterCategory, .MeterId , .MeterName,
.MeterRate.key, .MeterRate.value,
.MeterRegion, .MeterStatus, .MeterSubCategory, .MeterTags[], .Units]
| #csv
You may want to consider doing something similar for the MeterTags items so you don't end up with potentially random column counts.

PySpark - Getting list of dicts and converting its keys/values to columns

I have the following json (located in my local file system in path_json):
[
{
"name": "John",
"email": "john#hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary#heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
As you can see, this is a nested json.
I am reading it with the following command:
df = spark.read.option("multiline", "true").json(path_json)
Ok. Now, it produces me the following DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john#hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary#heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
I want to know if there is a way to produces the following dataframe:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john#hisemail.com|Male |guitar|null |null |John|
|null|mary#heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
A few comments:
My real data has thousands and thousands of lines
I don't know all the column_name in my dataset (there are many of them)
email is unique for each line, so it can be used as key if a join is necessary. I tried this approach before: create a main dataframe with columns [name,gender,email] and other dataframes for each row containing the dictionaries. But without success (and it doesn`t have good performance).
Thanks you so much!