Using Recursive feature while Flattening in Snowflake - json

I have a JSON string, which needs to be parsed in order to retrieve particular values.Here is an example I am working with;
{
"assignable_type": "SHIPMENT",
"rule": {
"rules": [
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "CLIENT_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"
]
}
],
"type": "AND"
}
],
"type": "OR"
},
"type": "USER_DEFINED"
}
The goal is to get the values when "target":"CLIENT_FID".
Expected Output for this JSON file should be ;
["flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"]
Here, as we can see rules is a list of dictionaries, and we can have nested lists as seen in the example.
Similarly, we have other JSON file of following type;
{
"assignable_type": "SHIPMENT",
"rule": {
"rules": [
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "PORT_OF_ENTRY_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:port:dbid/566788",
"flx::core:port:dbid/566931",
"flx::core:port:dbid/561482"
]
}
],
"type": "AND"
},
{
"meta_data": {},
"rules": [
{
"op": "IN",
"target": "PORT_OF_LOADING_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:port:dbid/561465"
]
},
{
"op": "IN",
"target": "SHIPMENT_MODE",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
0
]
},
{
"op": "IN",
"target": "CLIENT_FID",
"type": "ARRAY_VALUE_ASSERTION",
"values": [
"flx::core:client:dbid/28169"
]
}
],
"type": "AND"
}
],
"type": "OR"
},
"type": "USER_DEFINED"
}
For the second example ,
Expected Output shd be;
["flx::core:client:dbid/28169"]
As. seen, we may need to read the values at different depths in the file. In order to address this issue, I used following code;
/* first convert the string to a JSON object in cte1 */
with cte1 as (
select to_json(json_string) as json_rep,
parse_json(json_extract_path_text(json_rep, 'rule.rules')) as list_elem
from table 1),
cte2 as (select split_array,
json_extract_path_text(split_array, 'target') as target_client
from (
select json_rep,
list_elem,
t.value as split_array,
typeof(split_array) as obj_type,
index
from cte1,
table(flatten(cte1.list_elem, recursive=>true)) as t) temp /* use recursive feature */
where split_array ilike '%"target":"client_fid"%' /* filter for those rows containing this string */
and obj_type='OBJECT')
select
split_array,
json_extract_path_text(split_array, 'values') as client_values
from cte2
where target_client='CLIENT_FID'; /* filter the rows where we have the dictionary containing client fid */
In order to address the issue of varying depth at which client_fid is found we're recursing while flattening the string into rows. The output which is obtained for both of above inputs is provided below,
For the first String we get the actual output in variable client_values as
["flx::core:client:dbid/64171",
"flx::core:client:dbid/76049",
"flx::core:client:dbid/34040",
"flx::core:client:dbid/61806"]
Similarly, for the second string we get the actual output as
["flx::core:client:dbid/28169"]
As seen the code seems to be working in getting the correct output, but the way I filtered in the final query for target_client='CLIENT_FID'; it seems to be a very hacky way. Hence is it possible to get a better approach to resolve the issue of retrieving client fid values though the depth can vary in the given input.
Help is appreciated.

Related

Extract value of Tags from cloudTrail logs using Athena

I am trying to query cloudtrail logs using Athena. My goal is to find specific instances and extract them with their Tags.
The query I am using is:
SELECT eventTime, awsRegion , json_extract(responseelements, '$.instancesSet.items[0].instanceId') AS instanceId, json_extract(responseelements, '$.instancesSet.items[0].tagSet.items') AS TAGS FROM cloudtrail_logs_PP WHERE (eventName = 'RunInstances' OR eventName = 'StartInstances' ) AND requestparameters LIKE '%mytest1%' AND "timestamp" BETWEEN '2021/09/01' AND '2021/10/01' ORDER BY eventTime;
Using this query - I am able to get all Tags under one column.
Output of query
I want to extract only specific Tags and need help in the same. How cam I extract the only specific Tag?
I tried enhancing my query as json_extract(responseelements, '$.instancesSet.items[0].tagSet.items[0]' but the order of Tags is diff in diff logs - so cant pass the index location.
My json file in S3 is something like below:
{
"eventVersion": "1",
"eventTime": "2022-05-27T18:44:29Z",
"eventName": "RunInstances",
"awsRegion": "us-east-1",
"requestParameters": {
"instancesSet": {
"items": [{
"imageId": "ami-1234545",
"keyName": "DDKJKD"
}]
},
"instanceType": "m5.2xlarge",
"monitoring": {
"enabled": false
},
"hibernationOptions": {
"configured": false
}
},
"responseElements": {
"instancesSet": {
"items": [{
"tagSet": {
"items": [ {
"key": "11",
"value": "DS"
}, {
"key": "1",
"value": "A"
}]
}]
}
}
}

jOOQ JSON formatting as array of objects

I have the following (simplified) jOOQ query:
val result = context.select(
jsonObject(
key("id").value(ITEM.ID),
key("title").value(ITEM.NAAM),
key("resources").value(
jsonArrayAgg(ITEM_INHOUD.RESOURCE_ID).absentOnNull()
)
)
).from(ITEM).fetch()
Now the output that I want is:
[
{
"id": "0da04cc5-f70c-4fb3-b5c7-dc645d342631",
"title": "Title1",
"resources": [
"8b0f6d5c-67fc-47ca-be77-d1735e7721ce",
"ea0316db-1cfd-46d7-8260-5c1a4e65a0cd"
]
},
{
"id": "0f7e67e6-5187-47e2-9f1d-dab08feba38b",
"title": "Title2"
}
]
result.formtJSON() gives the following output:
{
"fields": [
{
"name": "json_object",
"type": "JSON"
}
],
"records": [
[
{
"id": "0da04cc5-f70c-4fb3-b5c7-dc645d342631",
"title": "Title 1"
}
]
]
}
Disabling the headers with result.formatJSON(JSONFormat.DEFAULT_FOR_RECORDS) will get me:
[
[
{
"id": "0da04cc5-f70c-4fb3-b5c7-dc645d342631",
"title": "Title1",
"resources": [
"8b0f6d5c-67fc-47ca-be77-d1735e7721ce",
"ea0316db-1cfd-46d7-8260-5c1a4e65a0cd"
]
}
],
[
{
"id": "0f7e67e6-5187-47e2-9f1d-dab08feba38b",
"title": "Title2"
}
]
]
where I don't want the extra array.
Further customizing the JSONformatter with result.formatJSON(JSONFormat().header(false).recordFormat(JSONFormat.RecordFormat.OBJECT)) I get:
[
{
"json_object": {
"id": "0da04cc5-f70c-4fb3-b5c7-dc645d342631",
"title": "Title1",
"resources": [
"8b0f6d5c-67fc-47ca-be77-d1735e7721ce",
"ea0316db-1cfd-46d7-8260-5c1a4e65a0cd"
]
}
},
{
"json_object": {
"id": "0f7e67e6-5187-47e2-9f1d-dab08feba38b",
"title": "Title2"
}
}
]
where I don't want the object wrapped in json_object.
Is there a way to get the output I want?
Doing it with Result.formatJSON()
This is clearly a flaw in the jOOQ 3.14.0 implementation of Result.formatJSON(). In the special case where there is only one column, and that column is of type JSON or JSONB, the column name may not really matter, and thus its contents should be flattened into the object describing the row. I've created a feature request for this: https://github.com/jOOQ/jOOQ/issues/10953. It will be available in jOOQ 3.15.0 and 3.14.4. You will be able to do this:
result.formatJSON(JSONFormat().header(false).wrapSingleColumnRecords(false));
The RecordFormat is irrelevant here. This works the same way for RecordFormat.ARRAY and RecordFormat.OBJECT
Doing it directly with SQL
Of course, you can always work around this by moving all the logic into SQL. You probably simplified your query by omitting a JOIN and GROUP BY. I'm assuming this is equivalent to what you want:
JSON result = context.select(
jsonArrayAgg(jsonObject(
key("id").value(ITEM.ID),
key("title").value(ITEM.NAAM),
key("resources").value(
select(jsonArrayAgg(ITEM_INHOUD.RESOURCE_ID).absentOnNull())
.from(ITEM_INHOUD)
.where(ITEM_INHOUD.ITEM_ID.eq(ITEM.ID))
)
))
).from(ITEM).fetchSingle().value1()
Note that JSON_ARRAYAGG() aggregates empty sets into NULL, not into an empty []. If that's a problem, use COALESCE()

Will there be a performance overhead when using an index having Object_Pairs (in case of a covered query) - Couchbase

Suppose I create an index on Object_pair(values).val.data.
Will my index store the “values” field as an array (with elements name for ID and val for data due to object_pair)?
If so, and also if my n1ql query is a covered query (fetching only Object_pair(values).val.data via select clause), will there still be a performance overhead? (because I am under the impression that in the above case, as index would already contain “values” field as an array, no actual object_pair transformation would take place hence avoiding the overhead. Only in the case of a non-covered query will the actual document be accessed and object_pair transformation done on “values” field).
Couchbase document:
"values": {
"item_1": {
"data": [{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
},
"item_2": {
"data": [{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
}```
UPDATE:
suppose if we plan to create index on Object_pair(values)[*].val.data & Object_pair(values)[*].name
Index: CREATE INDEX idx01 ON ent_comms_tracking(ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END)
Query: SELECT ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END as values_array FROM bucket
Can you please paste your full create index statement?
Creating index on OBJECT_PAIRS(values).val.data indexes nothing.
You can check it out by creating a primary index and then running below query:
SELECT OBJECT_PAIRS(`values`).val FROM mybucket
Output is:
[
{}
]
OBJECT_PAIRS(values) returns arrays of values which contain the attribute name and value pairs of the object values -
SELECT OBJECT_PAIRS(`values`) FROM mybucket
[
{
"$1": [
{
"name": "item_1",
"val": {
"data": [
{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
}
},
{
"name": "item_2",
"val": {
"data": [
{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
]
}
]
It's an array, so val of it is not directly referenced

In Logic Apps JSON Array while parsing throwing error for single object but for multiple objects it is working fine

While parsing JSON in Azure Logic App in my array I can get single or multiple values/objects (Box as shown in below example)
Both type of inputs are correct but when only single object is coming then it is throwing an error "Invalid type. Expected Array but got Object "
Input 1 (Throwing error) : -
{
"MyBoxCollection":
{
"Box":{
"BoxName": "Box 1"
}
}
}
Input 2 (Working Fine) : -
{
"MyBoxCollection":
[
{
"Box":{
"BoxName": "Box 1"
},
"Box":{
"BoxName": "Box 2"
}
}]
}
JSON Schema :
"MyBoxCollection": {
"type": "object",
"properties": {
"box": {
"type": "array",
items": {
"type": "object",
"properties": {
"BoxName": {
"type": "string"
},
......
.....
..
}
Error Details :-
[
{
"message": "Invalid type. Expected Array but got Object .",
"lineNumber": 0,
"linePosition": 0,
"path": "Order.MyBoxCollection.Box",
"schemaId": "#/properties/Root/properties/MyBoxCollection/properties/Box",
"errorType": "type",
"childErrors": []
}
]
I used to use the trick of injecting a couple of dummy rows in the resultset as suggested by the other posts, but I recently found a better way. Kudos to Thomas Prokov for providing the inspiration in his NETWORG blog post.
The JSON parse schema accepts multiple choices as type, so simply replace
"type": "array"
with
"type": ["array","object"]
and your parse step will happily parse either an array or a single value (or no value at all).
You may then need to identify which scenario you're in: 0, 1 or multiple records in the resultset? I'm pasting below how you can create a variable (ResultsetSize) which takes one of 3 values (rs_0, rs_1 or rs_n) for your switch:
"Initialize_ResultsetSize": {
"inputs": {
"variables": [
{
"name": "ResultsetSize",
"type": "string",
"value": "rs_n"
}
]
},
"runAfter": {
"<replace_with_name_of_previous_action>": [
"Succeeded"
]
},
"type": "InitializeVariable"
},
"Check_if_resultset_is_0_or_1_records": {
"actions": {
"Set_ResultsetSize_to_0": {
"inputs": {
"name": "ResultsetSize",
"value": "rs_0"
},
"runAfter": {},
"type": "SetVariable"
}
},
"else": {
"actions": {
"Set_ResultsetSize_to_1": {
"inputs": {
"name": "ResultsetSize",
"value": "rs_1"
},
"runAfter": {},
"type": "SetVariable"
}
}
},
"expression": {
"and": [
{
"equals": [
"#string(body('<replace_with_name_of_Parse_JSON_action>')?['<replace_with_name_of_root_element>']?['<replace_with_name_of_list_container_element>']?['<replace_with_name_of_item_element>']?['<replace_with_non_null_element_or_attribute>'])",
""
]
}
]
},
"runAfter": {
"Initialize_ResultsetSize": [
"Succeeded"
]
},
"type": "If"
},
"Process_resultset_depending_on_ResultsetSize": {
"cases": {
"Case_no_record": {
"actions": {
},
"case": "rs_0"
},
"Case_one_record_only": {
"actions": {
},
"case": "rs_1"
}
},
"default": {
"actions": {
}
},
"expression": "#variables('ResultsetSize')",
"runAfter": {
"Check_if_resultset_is_0_or_1_records": [
"Succeeded",
"Failed",
"Skipped",
"TimedOut"
]
},
"type": "Switch"
}
For this problem, I met another stack overflow post which is similar to this problem. While there is one "Box", it will be shown as {key/value pair} but not [array] when we convert it to json format. I think it is caused by design, so maybe we can just add a record "Box" at the source of your xml data such as:
<Box>specific_test</Box>
And do some operation to escape the "specific_test" in the next steps.
Another workaround for your reference:
If your json data has only one array, we can use it to do a judgment. We can judge the json data if it contains "[" character. If it contains "[", the return value is the index of the "[" character. If not contains, the return value is -1.
The expression shows as below:
indexOf('{"MyBoxCollection":{"Box":[aaa,bbb]}}', '[')
The screenshot above is the situation when it doesn't contain "[", it return -1.
Then we can add a "If" condition. If >0, do "Parse JSON" with one of the schema. If =-1, do "Parse JSON" with the other schema.
Hope it would be helpful to your problem~
We faced a similar issue. The only solution we find is by manipulating the XML before conversion. We updated XML nodes which needs to be an array even when we have single element using this. We used a Azure function to update the required XML attributes and then returned the XML for conversion in Logic Apps. Hope this helps someone.

Hive Sql Query To get Json Object from Json Array

I have a json inside 'content' column in the following format:
{ "identifier": [
{
"type": {
"coding": [
{
"code": "MRN",
}
]
},
"value": "181"
},
{
"type": {
"coding": [
{
"code": "PID",
}
]
},
"value": "5d3669b0"
},
{
"type": {
"coding": [
{
"code": "IPN",
}
]
},
"value": "41806"
}
]}
I have to run an hive query to get the "value" of the code which is equal to "MRN".
I have written the following query but its not giving the value as expected:
select get_json_object(content,'$.identifier.value')as Mrn from Doctor where get_json_object(content,'$.identifier.type.coding.code') like '%MRN%'
I dont want to give particular array position like:
select get_json_object(content,'$.identifier[0].value')as Mrn from Doctor where get_json_object(content,'$.identifier[0].type.coding.code') like '%MRN%'
As the json gets created randomly and the position is not fixed always.
Give [ * ] to avoid giving position.
select get_json_object(content,'$.identifier[*].value')as Mrn from Doctor where get_json_object(content,'$.identifier[*].type.coding.code') like '%MRN%'