Pyspark transform json into multiple dataframes - json

I have multiple json with this structure (association can have one or multiple objects & Charasteritics doesn't always has the same number of kv pairs:
{
"vl:VNETList": {
"Template": {
"ID": "SomeId",
"Object": [
{
"ID": "my_first_id",
"Context": {
"ID": "Avngate"
},
"Name": "Model Description",
"ClassID": "PID",
"Association": [
{
"Object": {
"ID": "test.svg",
"Context": {
"ID": "Avngate"
}
},
"#type": "is fulfilled by"
},
{
"Object": {
"ID": "Project Description",
"Context": {
"ID": "Avngate"
}
},
"#type": "is an element of"
}
],
"Characteristic": [
{
"Name": "InfoType",
"Value": "image/svg+xml"
},
{
"Name": "LOCK",
"Value": false
},
{
"Name": "EXFI",
"Value": 10000
}
]
},
{
"ID": "my_second_id",
"Context": {
"ID": "Avngate2"
},
"Name": "Model Description2",
"ClassID": "PID2",
"Association": [
{
"Object": {
"ID": "test2.svg",
"Context": {
"ID": "Avngate"
}
},
"#type": "is fulfilled by"
}
],
"Characteristic": [
{
"Name": "Dbtencoding",
"Value": "unicode"
}
]
}
]
}
}
I would like to build two dataframes like this:
and the second dataframe like this:
What's the best approach? If too complex, I would be able also to save the characteristics as a separate table referencing the objectId like with the association.

Read json and groupBy for the first one, just select for the second one with explode.
df1 = spark.read.json('test.json', multiLine=True)
df2 = df1.select(f.explode('vl:VNETList.Template.Object').alias('value')) \
.select('value.*')
df_f1 = df2.withColumn('Characteristic', f.explode('Characteristic')) \
.groupBy('ID', 'Name', 'ClassId') \
.pivot('Characteristic.Name') \
.agg(f.first('Characteristic.Value'))
df_f2 = df2.withColumn('Association', f.explode('Association')) \
.select('ID', 'Association.Object.ID', 'Association.#Type') \
.toDF('ID', 'AssociationId', 'AssociationType')
df_f1.show()
df_f2.show()
+------------+------------------+-------+-----------+-----+-------------+-----+
| ID| Name|ClassId|Dbtencoding| EXFI| InfoType| LOCK|
+------------+------------------+-------+-----------+-----+-------------+-----+
| my_first_id| Model Description| PID| null|10000|image/svg+xml|false|
|my_second_id|Model Description2| PID2| unicode| null| null| null|
+------------+------------------+-------+-----------+-----+-------------+-----+
+------------+-------------------+----------------+
| ID| AssociationId| AssociationType|
+------------+-------------------+----------------+
| my_first_id| test.svg| is fulfilled by|
| my_first_id|Project Description|is an element of|
|my_second_id| test2.svg| is fulfilled by|
+------------+-------------------+----------------+

Related

Using JQ to filter JSON data at different levels

I have this JSON data : some-json-file which contains the following
{
"result": [
{
"id": "1234567812345678",
"name": "somewebsite.com",
"status": "active",
"type": "secondary",
"activated_on": "2021-12-12T15:44:40.444433Z",
"plan": {
"id": "77777777777777777777777777",
"name": "Enterprise Website",
"is_subscribed": true,
"legacy_id": "enterprise",
"externally_managed": true
}
}
],
"result_info": {
"page": 1,
"total_pages": 1
},
"success": true,
"messages": []
}
And I am trying to get this filtered output from it using jq
{
"name": "somewebsite.com",
"type": "secondary",
"plan": {
"name": "Enterprise Website",
"id": "77777777777777777777777777"
}
}
But I can't figure out how to do that.
I can filter the first layer of labels like this
cat some-json-file | jq '.result[] | {name,type,plan}'
Which gets me this output
{
"name": "somewebsite.com",
"type": "secondary",
"plan": {
"id": "77777777777777777777777777",
"name": "Enterprise Website",
"is_subscribed": true,
"legacy_id": "enterprise",
"externally_managed": true
}
}
That gets me close, but I can't further filter the child labels under .plan so that I see just the .name and .id.
Any ideas? Thanks!
You were almost there. Just set the new context and use the same technique again:
jq '.result[] | {name,type,plan: .plan | {name,id}}' some-json-file
{
"name": "somewebsite.com",
"type": "secondary",
"plan": {
"name": "Enterprise Website",
"id": "77777777777777777777777777"
}
}
Demo
Note: You don't need to cat the input, jq accepts the filename as parameter.

need to extract specific string with JQ

I have a JSON file (see below) and with JQ I need to extract the resourceName value for value = mail#mail1.com
So in my case, the result should be name_1
Any idea to do that ?
Because this does not work :
jq '.connections[] | select(.emailAddresses.value | test("mail#mail1.com"; "i")) | .resourceName' file.json
{
"connections": [
{
"resourceName": "name_1",
"etag": "123456789",
"emailAddresses": [
{
"metadata": {
"primary": true,
"source": {
"type": "CONTACT",
"id": "123456"
}
},
"value": "mail#mail1.com",
}
]
},
{
"resourceName": "name_2",
"etag": "987654321",
"emailAddresses": [
{
"metadata": {
"primary": true,
"source": {
"type": "CONTACT",
"id": "654321"
},
"sourcePrimary": true
},
"value": "mail#mail2.com"
}
]
}
],
"totalPeople": 187,
"totalItems": 187
}
One solution is to store the parent object while selecting on the child array:
jq '.connections[] | . as $parent | .emailAddresses // empty | .[] | select(.value == "mail#mail1.com") | $parent.resourceName' file.json
emailAddresses is an array. Use any if finding one element that matches will suffice.
.connections[] | select(any(.emailAddresses[];.value == "mail#mail1.com")).resourceName

Extract keys and data and write an array

Given the following json source
{
"pages":{
"yomama/first key": {
"data": {
"fieldset": "lesson-video-overview",
"title": "5th Grade Math - Interpreting Fractions",
},
"order": 4
},
"yomama/second key": {
"data": {
"fieldset": "lesson-video-clip-single",
"title": "Post-Lesson Debrief Part 5",
},
"order": 14
},
"yopapa/Third key": {
"data": {
"fieldset": "lesson-video-clip-single",
"title": "Lesson Part 2B",
},
"order": 6
}
}
}
How could I output an array-type output like this? The main challenge for me is extracting the key e.g. "yomama/first key" and in the ideal world, I can filter like just give me an array of those keys that start with "yomama" (but not yopapa)
[
{
"url" : "yomama/first key",
"data": {
"fieldset": "lesson-video-overview",
"title": "5th Grade Math - Interpreting Fractions",
},
"order": 4
},
{
"url" : "yomama/second key",
"data": {
"fieldset": "lesson-video-clip-single",
"title": "Post-Lesson Debrief Part 5",
},
"order": 14
},
{
"url" : "yopapa/Third key",
"data": {
"fieldset": "lesson-video-clip-single",
"title": "Lesson Part 2B",
},
"order": 6
}
]
Assuming the input is in so.json and corrected to well-formatted JSON you may use:
jq '[.pages | to_entries[] | {"url": .key, "data": .value.data, "order": .value.order}]' < so.json
Here's a solution that does not require being explicit about including all the other keys:
.pages
| [ to_entries[]
| select(.key | startswith("yomama"))
| {url: .key} + .value ]

Getting unique values from nested Array using jq

Trying to get unique values stored in items array for each group. somehow it's always mixed...
My JSON looks like this:
{
"start": 1534425916,
"stop": 1535030716,
"groups": [
{
"group": "transmission",
"data": {
"events": 665762,
},
"items": [
{
"item": "manualni",
"data": {
"events": 389158,
}
},
{
"item": "automaticka",
"data": {
"events": 276604,
}
}
]
},
{
"group": "vat",
"data": {
"events": 671924,
},
"items": [
{
"item": "ne",
"data": {
"events": 346221,
}
},
{
"item": "ano",
"data": {
"events": 325703,
}
}
]
}
]
}
Desired result is the following:
{
"id": "transmission",
"value": [
"manualni",
"automaticka",
]
}
{
"id": "vat",
"value": [
"ne",
"ano"
]
}
Tried with this filter on command line:
| jq '{id: .groups[].group, value: [.groups[].items[].item]}'
Which results in the above mentioned mixed up result:
{
"id": "transmission",
"value": [
"manualni",
"automaticka",
"ne",
"ano"
]
}
{
"id": "vat",
"value": [
"manualni",
"automaticka",
"ne",
"ano"
]
}
Any idea how to receive the uniquified values here? Thanks in advance!
This gets the desired result. I think the manual entry under .[] explains why it works.
jq '.groups[] | {"id": .group, "value": [.items[].item]}'

Query parent data (multi-level) based on a child value, on a json file, using jq

I have a ksh script that retrives (using curl) a json file similar to the one bellow:
{
"Type1": {
"dev": {
"server": [
{ "group": "APP1", "name": "DAPP1002", "ip": "10.1.1.1" },
{ "group": "APP2", "name": "DAPP2001", "ip": "10.1.1.2" }
]
},
"qa": {
"server": [
{ "group": "APP1", "name": "QAPP1002", "ip": "10.1.2.1" },
{ "group": "APP2", "name": "QAPP2001", "ip": "10.1.2.2" }
]
},
"prod": {
"proxy": "type1.prod.proxy.mydomain.com",
"server": [
{ "group": "APP1", "name": "PAPP1001", "ip": "10.1.3.1" },
{ "group": "APP1", "name": "PAPP1002", "ip": "10.1.3.2" },
{ "group": "APP2", "name": "PAPP2001", "ip": "10.1.3.3" }
]
}
},
"Type2": {
"dev": {
"server": [
{ "group": "APP8", "name": "DAPP8002", "ip": "10.2.1.1" },
{ "group": "APP9", "name": "DAPP9001", "ip": "10.2.1.2" }
]
},
"qa": {
"server": [
{ "group": "APP8", "name": "QAPP8002", "ip": "10.2.2.1" },
{ "group": "APP9", "name": "QAPP9001", "ip": "10.2.2.2" }
]
},
"prod": {
"proxy": "type2.prod.proxy.mydomain.com",
"server": [
{ "group": "APP8", "name": "PAPP8001", "ip": "10.2.3.1" },
{ "group": "APP9", "name": "PAPP9001", "ip": "10.2.3.2" },
{ "group": "APP9", "name": "PAPP9002", "ip": "10.2.3.3" }
]
}
}
}
... based on a server name (field "name") I would have to collect the following info, to pass to a function:
"Type", "name", "ip", "proxy"
(Note that the "proxy" info is optional)
I am new to json, and I am trying to get this filtered with jq but so far, I am out of lucky.
What I acomplished so far is the following jq query, when searching for "PAPP9001" :
jq '.[] | .[] | select(.server[].name=="PAPP9001") | .proxy as $proxy | .server[] | {proxy: $proxy, name: .name, ip: .ip} | select(.name=="PAPP9001")' curlreturn.json
which returns me:
{
"proxy": "type2.prod.proxy.mydomain.com",
"name": "PAPP9001",
"ip": "10.2.3.2"
}
but:
I could not get the "Type" info, at the top level
Considering the number of pipes and the 2 selects, I doubt that this is the most efficient way.
One way to retrieve the key names programmatically is using to_entries. For example, given your input, this jq filter:
to_entries[]
| .key as $type
| .value[]
| .proxy as $proxy
| .server[]
| select(.name == "PAPP9001")
| { Type: $type, name, ip, proxy: $proxy }
yields:
{
"Type": "Type2",
"name": "PAPP9001",
"ip": "10.2.3.2",
"proxy": "type2.prod.proxy.mydomain.com"
}
Variations
If, for example, you wanted these four fields as a CSV row, then you could replace the last line of the filter above with:
| [$type, .name, .ip, $proxy] | #csv
See the jq manual for how to use string interpolation.