ADF Data Flow flatten JSON to rows - json

IN ADF Data Flow how can I flatten JSON into rows rather than columns?
{
"header": [
{
"main": {
"id": 1
},
"sub": [
{
"type": "a",
"id": 2
},
{
"type": "b",
"id": 3
}
]}]}
In ADF I'm using the flatten task and get the below result:
However the result I'm trying to achieve is merging the two id columns into one column like below:

Since both main_id and sub_id belong in the same column, instead of using 1 flatten to flatten all the data, flatten both main and sub separately.
I have taken the following JSON as source for my dataflow.
{
"header":[
{
"main":{
"id":1
},
"sub":[
{
"type":"a",
"id":2
},
{
"type":"b",
"id":3
}
]
},
{
"main":{
"id":4
},
"sub":[
{
"type":"c",
"id":5
},
{
"type":"d",
"id":6
}
]
}
]
}
I have taken 2 flatten transformations flattenMain and flattenSub instead of 1 which use the same source.
For flattenMain, I have unrolled by header and selected unroll root as header. Then created an additional column selecting source column header.main.id.
The data preview for flattenMain would be:
For flattenSub, I have unrolled by header.sub and selected unroll root as header.sub. Then created 2 additional column selecting source column header.sub.id as id column and header.sub.type as type column.
The data preview for flattenSub transformation would be:
Now I have applied union transformation on both flattenMain and flattenSub. I have applied union by using Name.
The final data preview for this Union transformation will give the desired result.
NOTE: All the highlighted rows in output images indicate the result that would be achieved when we use the JSON sample provided in the question.

Related

Parse JSON object dynamically in Bigquery + dbt

I have a json message like below. I am using dbt and with Big query plug in. I need to create table dynamically in Big query
{
"data": {
"schema":"dev",
"payload": {
"lastmodifieddate": "2022-11-122 00:01:28",
"changeeventheader": {
"changetype": "UPDATE",
"changefields": [
"lastmodifieddate",
"product_value"
],
"committimestamp": 18478596845860,
"recordIds":[
"568069"
]
},
"product_value" : 20000
}
}
}
I need to create table dynamically with recordIds and changed fields. This field list changes dynamically whenever source sends update..
Expected output:
recordIds | product_value | lastmodifieddate |changetype
568069 | 20000 | 2022-11-122 00:01:28 |UPDATE
Thanks for your suggestions and help!.
JSON objects can be saved in a BigQuery table. There is no need to use dbt here.
with tbl as (select 5 row, JSON '''{
"data": {
"schema":"dev",
"payload": {
"lastmodifieddate": "2022-11-122 00:01:28",
"changeeventheader": {
"changetype": "UPDATE",
"changefields": [
"lastmodifieddate",
"product_value"
],
"committimestamp": 18478596845860,
"recordIds":[
"568069"
]
},
"product_value" : 20000
}
}
}''' as JS)
select *,
JSON_EXTRACT_STRING_ARRAY(JS.data.payload.changeeventheader.recordIds) as recordIds,
JSON_EXTRACT_SCALAR(JS.data.payload.product_value) as product_value,
Json_value(JS.data.payload.lastmodifieddate) as lastmodifieddate,
Json_value(JS.data.payload.changeeventheader.changetype) as changetype
from tbl
If the JSON is saved as string in a BigQuery table, please use PARSE_JSON(column_name) to convert the string to JSON first.

postgres json return an element conditionally

I have objects which contain an ID either in the element itself or within a json array contained within another object.
For example:
{
"activity": {
"id": 12345
}
}
Within an array:
{
"activities": [
{
"entity_id": 23456
},
{
"entity_id": 34567
}
]
}
If I run the query select activity->'id', json_array_elements(activity->'activities')->'entity_id' from activities I only get back rows where the activities array exists and not rows that have the id in the object itself.
I can do something like the below, but it seems like there should be an easier way to do this?

Translate table of data to Samrtsheet API cells JSON

I've got a process that reads a mapping configuration
stored in smartsheet.
This is where the process admins can control the data flow.
Ultimately this will be stored in a snowflake table to be used by other Talend flows.
I've brought it to this point which includes getting updated names formapped columns and sheets.
My objective now is to create JSON to add the new rows to Smartsheet and also update existing rows.
The only difference in the two calls is the inclusion of the row id in Smartsheet.
For this example, I am focusing on new rows.
I'm confident I can adapt any solution that addresses new rows to updating existing rows as it only involves one more JSON attribute.
I'm having a little trouble wrapping my head around Smartsheet's unique way of storing rows and columns in JSON responses.
Each Row is a collection of cells.
Each cell is a collection of attributes which includes the column id the cell belongs to.
Here is what I have at this point
Data to be converted to JSON in a table read from cache memory (tHashOutput and tHashInput via several steps addressing other requirements)
LOADED_DATE_TIME_STR
SSHEET_NAME
SSHEET_ID
SSHEET_ROW_ID
SSHEET_COL_ID
SSHEET_COL_NAME
DB_TBL_NAME
DB_COL_NAME
20220221232059
sheet_name_1_str
sheet_id_1_int
null
xxxxxxxxxxxxxxx1
sheet_1_col_name_a_str
null
null
20220221232059
sheet_name_2_str
sheet_id_2_int
null
xxxxxxxxxxxxxxx2
sheet_2_col_name_b_str
null
null
20220221232059
sheet_name_2_str
sheet_id_2_int
null
xxxxxxxxxxxxxxx3
sheet_2_col_name_c_str
null
null
The mapping configuration sheet has 5 important columns
that I have not yet incorported their mapping and ids into the process.
I will do this once I have an idea of where this part of the flow is headed
LOADED_DATE_TIME_STR = col_id_1_int
SSHEET_NAME = col_id_2_int
SSHEET_ID = col_id_3_int
SSHEET_COL_ID = col_id_4_int
SSHEET_COL_NAME = col_id_6_int
Output json format: (this will be a sub-element of a larger JSON tree).
Specifically, each array of cells defines a row and will be a sub-element
of the row within the Smartsheet API structure
{
"cells": [
{
"columnId": col_id_1_int,
"value": "20220221232059"
},
{
"columnId": col_id_2_int,
"value": "sheet_name_1_str"
},
{
"columnId": col_id_3_int,
"value": "sheet_id_1_int"
}
{
"columnId": col_id_4_int,
"value": "xxxxxxxxxxxxxxx1"
}
{
"columnId": col_id_6_int,
"value": "sheet_1_col_name_a_str"
}
]
},
{
"cells": [
{
"columnId": col_id_1_int,
"value": "20220221232059"
},
{
"columnId": col_id_2_int,
"value": "sheet_name_2_str"
},
{
"columnId": col_id_3_int,
"value": "sheet_id_2_int"
}
{
"columnId": col_id_4_int,
"value": "xxxxxxxxxxxxxxx2"
}
{
"columnId": col_id_6_int,
"value": "sheet_2_col_name_b_str"
}
]
},
...

jq - retrieve values from json table on one line for specific columns

I'm trying to get cell values from a json formatted table but only for specific columns and have it output into its own object.
json example -
{
"rows":[
{
"id":409363222161284,
"rowNumber":1,
"cells":[
{
"columnId":"nameColumn",
"value":"name1"
},
{
"columnId":"infoColumn",
"value":"info1"
},
{
"columnId":"excessColumn",
"value":"excess1"
}
]
},
{
"id":11312541213,
"rowNumber":2,
"cells":[
{
"columnId":"nameColumn",
"value":"name2"
},
{
"columnId":"infoColumn",
"value":"info2"
},
{
"columnId":"excessColumn",
"value":"excess2"
}
]
},
{
"id":11312541213,
"rowNumber":3,
"cells":[
{
"columnId":"nameColumn",
"value":"name3"
},
{
"columnId":"infoColumn",
"value":"info3"
},
{
"columnId":"excessColumn",
"value":"excess3"
}
]
}
]
}
Ideal output would be filtered by two columns - nameColumn, infoColumn - with each row being a single line of the values.
Output example -
{
"name": "name1",
"info": "info1"
}
{
"name": "name2",
"info": "info2"
}
{
"name": "name3",
"info": "info3"
}
I've tried quite a few different combinations of things with select statements and this is the closest I've come but it only uses one.
jq '.rows[].cells[] | {name: (select(.columnId=="nameColumn") .value), info: "infoHereHere"}'
{
"name": "name1",
"info": "infoHere"
}
{
"name": "name2",
"info": "infoHere"
}
{
"name": "name3",
"info": "infoHere"
}
If I try to combine another one, it's not so happy.
jq -j '.rows[].cells[] | {name: (select(.columnId=="nameColumn") .value), info: (select(.columnId=="infoColumn") .value)}'
Nothing is output.
** Edit **
Apologies for being unclear with this. The final output would ideally be a csv for the selected columns values
name1,info1
name2,info2
Presumably you would want the output to be grouped by row, so let's first consider:
.rows[].cells
| map(select(.columnId=="nameColumn" or .columnId=="infoColumn"))
This produces a stream of JSON arrays, the first of which using your main example would be:
[
{
"columnId": "nameColumn",
"value": "name1"
},
{
"columnId": "infoColumn",
"value": "info1"
}
]
If you want the output in some alternative format, then you could tweak the above jq program accordingly.
If you wanted to select a large number of columns, the use of a long "or" expression might become unwieldy, so you might also want to consider using a "whitelist". See e.g. Whitelisting objects using select
Or you might want to use del to delete the unwanted columns.
Producing CSV
One way would be to use #csv with the -r command-line option, e.g. with:
| map(select(.columnId=="nameColumn" or .columnId=="infoColumn")
| {(.columnId): .value} )
| add
| [.nameColumn, .infoColumn]
| #csv

Pentaho Kettle: How to dynamically fetch JSON file columns

Background: I work for a company that basically sells passes. Every order that is placed by the customer will contain N number of passes.
Issue: I have these JSON event-transaction files coming into a S3 bucket on a daily basis from DocumentDB (MongoDB). This JSON file is associated to the relevant type of event (insert, modify or delete) for every document key (which is an order in my case). The example below illustrates a "Insert" type of event that came through to the S3 bucket:
{
"_id": {
"_data": "11111111111111"
},
"operationType": "insert",
"clusterTime": {
"$timestamp": {
"t": 11111111,
"i": 1
}
},
"ns": {
"db": "abc",
"coll": "abc"
},
"documentKey": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
}
},
"fullDocument": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
},
"orderNumber": "1234567",
"externalOrderId": "12345678",
"orderDateTime": "2020-09-11T08:06:26Z[UTC]",
"attraction": "abc",
"entryDate": {
"$date": 2020-09-13
},
"entryTime": {
"$date": 04000000
},
"requestId": "abc",
"ticketUrl": "abc",
"tickets": [
{
"passId": "1111111",
"externalTicketId": "1234567"
},
{
"passId": "222222222",
"externalTicketId": "122442492"
}
],
"_class": "abc"
}
}
As we see above, every JSON file might contain N number of passes and every pass is - in turn - is associated to an external ticket id, which is a different column (as seen above). I want to use Pentaho Kettle to read these JSON files and load the data into the DW. I am aware of the Json input step and Row Normalizer that could then transpose "PassID 1", "PassID 2", "PassID 3"..."PassID N" columns into 1 unique column "Pass" and I would have to have to apply a similar logic to the other column "External ticket id". The problem with that approach is that it is quite static, as in, I need to "tell" Pentaho how many Passes are coming in advance in the Json input step. However what if tomorrow I have an order with 10 different passes? How can I do this dynamically to ensure the job will not break?
If you want a tabular output like
TicketUrl Pass ExternalTicketID
---------- ------ ----------------
abc PassID1Value1 ExTicketIDvalue1
abc PassID1Value2 ExTicketIDvalue2
abc PassID1Value3 ExTicketIDvalue3
And make incoming value dynamic based on JSON input file values, then you can download this transformation Updated Link
I found everything work dynamic in JSON input.