How to parse nested json and write in Redshift? - json

I have a following json structure like this:
{
"firstname": "A",
"lastname": "B",
"age": 24,
"address": {
"streetAddress": "123",
"city": "San Jone",
"state": "CA",
"postalCode": "394221"
},
"phonenumbers": [
{ "type": "home", "number": "123456789" }
{ "type": "mobile", "number": "987654321" }
]
}
I need to copy this json from S3 to a Redshift table.
I am currently using copy command with a path file but it loads array as a single column.
I wanted the nested array to be parsed and the table should like this:
firstname|lastname|age|streetaddress|city |state|postalcode|type|number
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |home|123456789
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |mob|987654321
Is there a way to do that?

You can do use nested JSON paths by making use of JSON path files. However, this does not work with the multiple phone number types.
If you can modify the dataset to have multiple records (one for mobile, one for home) then your file would look similar to the below.
{
"jsonpaths": [
"$.firstname",
"$.lastname",
"$.venuestate",
"$.age",
"$.address.streetAddress",
"$.address.city",
"$.address.state",
"$.address.postalCode",
"$.phonenumbers[0].type",
"$.phonenumbers[0].number"
]
}
If you are unable to change the format you will need to perform an ETL task upon load before it can be consumed by Redshift. For this you could use an event for creation of objects to trigger a Lambda function and then perform the ETL process for you before it loads into Redshift.

Related

Add data to a json file using Talend

I have the following JSON:
[
{
"date": "29/11/2021",
"Name": "jack",
},
{
"date": "30/11/2021",
"Name": "Adam",
},
"date": "27/11/2021",
"Name": "james",
}
]
Using Talend, I wanna add 2 lines to have something like:
[
{
"company": "AMA",
"service": "BI",
"date": "29/11/2021",
"Name": "jack",
},
{
"company": "AMA",
"service": "BI",
"date": "30/11/2021",
"Name": "Adam",
},
"company": "AMA",
"service": "BI",
"date": "27/11/2021",
"Name": "james",
}
]
Currently, I use 3 components (tJSONDocOpen, tFixedFlowInput, tJSONDocOutput) but I can't have the right configuration of components in order to get the job done !
If you are not comfortable with json .
Just do these steps :
In the metaData just create a FileJson like this then paste it in your job as a tFileInputJson
Your job design and mapping would be
In your tFileOutputJson don't forget to change in the name of the data block "Data" with ""
What you need to do there according to the Talend practices is read your JSON. Then extract each object of it, add your properties and finally rebuild your JSON in a file.
An efficient way to do this is using tMap componenent like this.
The first tFileInputJSON will have to specify what properties it has to read from the JSON by setting your 2 objects in the mapping field.
Then the tMap will simply add 2 columns to your main stream, here is an example with hard coded string values. Depending on you needs, this component will also offer you the possibility to assign dynamic data to your 2 new columns, it's a powerful tool for manipulating the structure of a data stream.
You will find more infos about this component in the official documentation : https://help.talend.com/r/en-US/7.3/tmap/tmap; especially the "tMap scenarios" part.
Note
Instead of using the tMap, if you are comfortable with Java, you can use a tjavaRow instead. Using this, you can setup your 2 new columns with whatever java code you want to put as long as you have defined the output schema of the component.
output_row.Name = input_row.Name;
output_row.date = input_row.date;
output_row.company = "AMA";
output_row.service = "BI";

Converting JSON Table Array Data in Azure Data Factory from Log Analytics REST API JSON Response to Multiple JSON Documents in same file

I'm currently trying to extract data out of Log Analytics through its REST API. I have been successful at using a Copy Data activity to store the response in an Azure Data Lake Gen 2 account.
The format is roughly similar to the example from the Log Analytics API Reference Page.
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Category",
"type": "string"
},
{
"name": "count_",
"type": "long"
}
],
"rows": [
[
"Administrative",
20839
],
[
"Recommendation",
122
],
[
"Alert",
64
],
[
"ServiceHealth",
11
]
]
}
] }
My dataset is much larger with more columns more values etc but the principals are the same.
What I am trying to do is generate a new JSON file that would hold the table but multiple documents in the same file e.g.
[{
"Category": "Administrative",
"count_": 20839
},
{
"Category": "Recommendation",
"count_": 122
},
{
"Category": "Alert",
"count_": 64
},
{
"Category": "ServiceHealth",
"count_": 11
}]
The output of this would be stored back into the data lake and then ideally could be used as a source for a copy activity to go into an Azure SQL Database.
I have tried accomplishing this using Data Flows Flattening but haven't been successful with this up until this point as when trying to map the column name it doesn't see individual column names just that level of the document where the column names are defined.
How would I go about flattening the dataset so it appears as desired? Is this an unrealistic expectation of Data flows or is this task more suitable for something like Azure Databricks?

Pentaho Kettle: How to dynamically fetch JSON file columns

Background: I work for a company that basically sells passes. Every order that is placed by the customer will contain N number of passes.
Issue: I have these JSON event-transaction files coming into a S3 bucket on a daily basis from DocumentDB (MongoDB). This JSON file is associated to the relevant type of event (insert, modify or delete) for every document key (which is an order in my case). The example below illustrates a "Insert" type of event that came through to the S3 bucket:
{
"_id": {
"_data": "11111111111111"
},
"operationType": "insert",
"clusterTime": {
"$timestamp": {
"t": 11111111,
"i": 1
}
},
"ns": {
"db": "abc",
"coll": "abc"
},
"documentKey": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
}
},
"fullDocument": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
},
"orderNumber": "1234567",
"externalOrderId": "12345678",
"orderDateTime": "2020-09-11T08:06:26Z[UTC]",
"attraction": "abc",
"entryDate": {
"$date": 2020-09-13
},
"entryTime": {
"$date": 04000000
},
"requestId": "abc",
"ticketUrl": "abc",
"tickets": [
{
"passId": "1111111",
"externalTicketId": "1234567"
},
{
"passId": "222222222",
"externalTicketId": "122442492"
}
],
"_class": "abc"
}
}
As we see above, every JSON file might contain N number of passes and every pass is - in turn - is associated to an external ticket id, which is a different column (as seen above). I want to use Pentaho Kettle to read these JSON files and load the data into the DW. I am aware of the Json input step and Row Normalizer that could then transpose "PassID 1", "PassID 2", "PassID 3"..."PassID N" columns into 1 unique column "Pass" and I would have to have to apply a similar logic to the other column "External ticket id". The problem with that approach is that it is quite static, as in, I need to "tell" Pentaho how many Passes are coming in advance in the Json input step. However what if tomorrow I have an order with 10 different passes? How can I do this dynamically to ensure the job will not break?
If you want a tabular output like
TicketUrl Pass ExternalTicketID
---------- ------ ----------------
abc PassID1Value1 ExTicketIDvalue1
abc PassID1Value2 ExTicketIDvalue2
abc PassID1Value3 ExTicketIDvalue3
And make incoming value dynamic based on JSON input file values, then you can download this transformation Updated Link
I found everything work dynamic in JSON input.

PySpark - Getting list of dicts and converting its keys/values to columns

I have the following json (located in my local file system in path_json):
[
{
"name": "John",
"email": "john#hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary#heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
As you can see, this is a nested json.
I am reading it with the following command:
df = spark.read.option("multiline", "true").json(path_json)
Ok. Now, it produces me the following DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john#hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary#heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
I want to know if there is a way to produces the following dataframe:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john#hisemail.com|Male |guitar|null |null |John|
|null|mary#heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
A few comments:
My real data has thousands and thousands of lines
I don't know all the column_name in my dataset (there are many of them)
email is unique for each line, so it can be used as key if a join is necessary. I tried this approach before: create a main dataframe with columns [name,gender,email] and other dataframes for each row containing the dictionaries. But without success (and it doesn`t have good performance).
Thanks you so much!

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15