parse the nested json have same attribute with jq streaming mode - json

I want to parse the data from the below nested json file, but it has too many "keys" in the json, it makes hard to parse the data
{
"jobname": {
"keys": {
"jobid":"E000295",
"car":"BMW"
},
"property":{
"doctype":"File",
"areadesc":[
{
"areaid":"qaz",
"weather":"hot",
},
{
"areaid":"wsx",
"weather":"code",
},
{
"areaid":"edc",
"weather":"hot",
},
{
"areaid":"rfv",
"weather":"hot",
}
]
},
"toolJobs":[
{
"keys":{
"toolid":"123"
},
"reports":[
{
"keys":{
"oiltype":"a",
"oilcountry":"us"
},
"property":{"reportid":"001"},
"datas":[
{
"keys":{"areaid":"qaz"},
"data":[
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 3
},
]
},
{
"keys":{"areaid":"wsx"},
"data":[
{
"time": "2021-01-03",
"value": 5
},
{
"time": "2021-01-04",
"value": 7
},
]
},
]
},
{
"keys":{
"oiltype":"b",
"oilcountry":"china"
},
"property":{"reportid":"002"},
"datas":[
{
"keys":{"areaid":"edc"},
"data":[
{
"time": "2021-01-05",
"value": 2
},
{
"time": "2021-01-06",
"value": 4
},
]
},
{
"keys":{"areaid":"rfv"},
"data":[
{
"time": "2021-01-07",
"value": 6
},
{
"time": "2021-01-08",
"value": 8
},
]
},
]
}
]
}
]
}
}
until now, I can use the below code to get the basic result, but some columns do not have, such as oiltype, oilcountry, reportid, areaid
cat tmp1.json | jq -cn --stream '
[fromstream(
1|truncate_stream(inputs)
| (.[0][:2] | index("keys")) as $ix
| if $ix then .[0] |= .[1+$ix:]
else (.[0] | index("toolJobs")) as $iy | (.[0][$iy:$iy+3] | index("keys")) as $iz
| if $iz then .[0] |= .[1+$iy+$iz:]
else (.[0] | index("data")) as $ik
| if $ik then .[0] |= .[$ik:]
else empty
end
end
end
)] | .[0] as $header | .[1] as $tool | [.[2:][] | ($header+ $tool+.)] | .'
The result is
[
{"jobid":"E000295","car":"BMW","toolid":"123","data":[{"time":"2021-01-01","value":1},{"time":"2021-01-02","value":3}]},
{"jobid":"E000295","car":"BMW","toolid":"123","data":[{"time":"2021-01-03","value":5},{"time":"2021-01-04","value":7}]},
{"jobid":"E000295","car":"BMW","toolid":"123","data":[{"time":"2021-01-05","value":2},{"time":"2021-01-06","value":4}]},
{"jobid":"E000295","car":"BMW","toolid":"123","data":[{"time":"2021-01-07","value":6},{"time":"2021-01-08","value":8}]}]
I also try below code
cat tmp1.json | jq -cn --stream '
[fromstream(
1|truncate_stream(inputs)
| (.[0][:2] | index("keys")) as $ix
| if $ix then .[0] |= .[1+$ix:]
else (.[0] | index("toolJobs")) as $iy | (.[0][$iy:$iy+3] | index("keys")) as $iz
| if $iz then .[0] |= .[1+$iy+$iz:]
else (.[0] | index("data")) as $ik
| if $ik then .[0] |= .[$ik:]
else (.[0] | index("reports")) as $iw | (.[0][$iw:$iw+3] | index("property")) as $ii
| if $ii then (.[0] |= .[$iw+$ii:])
else (.[0] | index("keys")) as $ij
| if $ij then (.[0] |= .[$ij:])
else empty
end
end
end
end
end
)] | .[0] as $header | .[1] as $prjob | [.[2:][] | ($header + $prjob + .)] | .'
but the result is strange
[
{"jobid":"E000295","car":"BMW","property":{"reportid":"001"},"toolid":"123","keys":{"oiltype":"a","oilcountry":"us","areaid":"qaz"},"data":[{"time":"2021-01-01","value":1},{"time":"2021-01-02","value":3}]},
{"jobid":"E000295","car":"BMW","property":{"doctype":"File","areadesc":[{"areaid":"qaz","weather":"hot"},{"areaid":"wsx","weather":"code"},{"areaid":"edc","weather":"hot"},{"areaid":"rfv","weather":"hot"}]},"toolid":"123","keys":{"areaid":"wsx"},"data":[{"time":"2021-01-03","value":5},{"time":"2021-01-04","value":7}]},
{"jobid":"E000295","car":"BMW","property":{"reportid":"002"},"toolid":"123","keys":{"oiltype":"b","oilcountry":"china","areaid":"edc"},"data":[{"time":"2021-01-05","value":2},{"time":"2021-01-06","value":4}]},
{"jobid":"E000295","car":"BMW","property":{"doctype":"File","areadesc":[{"areaid":"qaz","weather":"hot"},{"areaid":"wsx","weather":"code"},{"areaid":"edc","weather":"hot"},{"areaid":"rfv","weather":"hot"}]},"toolid":"123","keys":{"areaid":"rfv"},"data":[{"time":"2021-01-07","value":6},{"time":"2021-01-08","value":8}]}
]
Below is my expected result
[
{
"jobid":"E000295",
"car":"BMW",
"toolid":"123",
"oiltype":"a",
"oilcountry":"us",
"reportid":"001",
"areaid":"qaz",
"data":[
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 3
},
]
},
{
"jobid":"E000295",
"car":"BMW",
"toolid":"123",
"oiltype":"a",
"oilcountry":"us",
"reportid":"001",
"areaid":"wsx",
"data":[
{
"time": "2021-01-03",
"value": 5
},
{
"time": "2021-01-04",
"value": 7
},
]
},
{
"jobid":"E000295",
"car":"BMW",
"toolid":"123",
"oiltype":"b",
"oilcountry":"china",
"reportid":"002",
"areaid":"edc",
"data":[
{
"time": "2021-01-05",
"value": 2
},
{
"time": "2021-01-06",
"value": 4
},
]
},
{
"jobid":"E000295",
"car":"BMW",
"toolid":"123",
"oiltype":"b",
"oilcountry":"china",
"reportid":"002",
"areaid":"rfv",
"data":[
{
"time": "2021-01-07",
"value": 6
},
{
"time": "2021-01-08",
"value": 8
},
]
}
]
Does anyone have any idea?

Assuming the input has been corrected, the following "regular" jq program produces the desired result:
[
.jobname
| (.keys + .toolJobs[].keys) as $one
| .toolJobs[]
| .keys as $two
| .reports[]
| (.keys + .property) as $three
| .datas[]
| (.keys + {data}) as $four
| $one + $two + $three + $four
]
If your input is too large, you could reduce the memory requirements by creating a jq-to-jq pipeline, with the first invocation using the above program (or a --stream version of it) but with the outer brackets removed.

Related

How to transform nested JSON to csv using jq

I have tried to transform json in the following format to csv using jq on Linux cmd line, but with no success. Any help of guidance would be appreciated.
{
"dir/file1.txt": [
{
"Setting": {
"SettingA": "",
"SettingB": null
},
"Rule": "Rulechecker.Rule15",
"Description": "",
"Line": 11,
"Link": "www.sample.com",
"Message": "Some message",
"Severity": "error",
"Span": [
1,
3
],
"Match": "[id"
},
{
"Setting": {
"SettingA": "",
"SettingB": null
},
"Check": "Rulechecker.Rule16",
"Description": "",
"Line": 27,
"Link": "www.sample.com",
"Message": "Fix the rule",
"Severity": "error",
"Span": [
1,
3
],
"Match": "[id"
}
],
"dir/file2.txt": [
{
"Setting": {
"SettingA": "",
"SettingB": null
},
"Rule": "Rulechecker.Rule17",
"Description": "",
"Line": 51,
"Link": "www.example.com",
"Message": "Fix anoher 'rule'?",
"Severity": "error",
"Span": [
1,
18
],
"Match": "[source,terminal]\n----\n"
}
]
}
Ultimately, I want to present a matrix with dir/file1.txt, dir/file2.txt as rows on the left of the matrix, and all the keys to be presented as column headings, with the corresponding values.
| Filename | SettingA | SettingB | Rule | More columns... |
| -------- | -------------- | -------------- | -------------- | -------------- |
| dir/file1.txt | | null | Rulechecker.Rule15 | |
| dir/file1.txt | | null | Rulechecker.Rule16 | |
| dir/file2.txt | | null | Rulechecker.Rule17 | |
Iterate over the top-level key-value pairs obtained by to_entries to get access to the key names, then once again over its content array in .value to get the array items. Also note that newlines as present in the sample's last .Match value cannot be used as is in a line-oriented format such as CSV. Here, I chose to replace them with the literal string \n using gsub.
jq -r '
to_entries[] | . as {$key} | .value[] | [$key,
(.Setting | .SettingA, .SettingB),
.Rule // .Check, .Description, .Line, .Link,
.Message, .Severity, .Span[], .Match
| strings |= gsub("\n"; "\\n")
] | #csv
'
"dir/file1.txt","",,"Rulechecker.Rule15","",11,"www.sample.com","Some message","error",1,3,"[id"
"dir/file1.txt","",,"Rulechecker.Rule16","",27,"www.sample.com","Fix the rule","error",1,3,"[id"
"dir/file2.txt","",,"Rulechecker.Rule17","",51,"www.example.com","Fix anoher 'rule'?","error",1,18,"[source,terminal]\n----\n"
Demo
If you just want to dump all the values in the order they appear, you can simplify this by using .. | scalars to traverse the levels of the document:
jq -r '
to_entries[] | . as {$key} | .value[] | [$key,
(.. | scalars) | strings |= gsub("\n"; "\\n")
] | #csv
'
"dir/file1.txt","",,"Rulechecker.Rule15","",11,"www.sample.com","Some message","error",1,3,"[id"
"dir/file1.txt","",,"Rulechecker.Rule16","",27,"www.sample.com","Fix the rule","error",1,3,"[id"
"dir/file2.txt","",,"Rulechecker.Rule17","",51,"www.example.com","Fix anoher 'rule'?","error",1,18,"[source,terminal]\n----\n"
Demo
As for the column headings, for the first case I'd add them manually, as you spell out each value path anyways. For the latter case it will be a little complicated as not all coulmns have immediate names (what should the items of array Span be called?), and some seem to change (in the second record, column Rule is called Check). You could, however, stick to the names of the first record, and taking the deepest field name either as is or add the array indices. Something along these lines would do:
jq -r '
to_entries[0].value[0] | ["Filename", (
path(..|scalars) | .[.[[map(strings)|last]]|last:] | join(".")
)] | #csv
'
"Filename","SettingA","SettingB","Rule","Description","Line","Link","Message","Severity","Span.0","Span.1","Match"
Demo

Convert string in dataframe pyspark to table, obtaining only the necessary from string

{
"schema": {
"type": "struct",
"fields": [
{
"type": "int32",
"optional": true,
"field": "c1"
},
{
"type": "string",
"optional": true,
"field": "c2"
},
{
"type": "int64",
"optional": false,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "create_ts"
},
{
"type": "int64",
"optional": false,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "update_ts"
}
],
"optional": false,
"name": "foobar"
},
"payload": {
"c1": 67,
"c2": "foo",
"create_ts": 1663920002000,
"update_ts": 1663920002000
}
}
I have my json string in this format and I don't want the whole data into data into table , wanted the table in this format.
| c1 | c2 | create_ts | update_ts |
+------+------+------------------+---------------------+
| 1 v| foo | 2022-09-21 10:47:54 | 2022-09-21 10:47:54 |
| 28 | foo | 2022-09-21 13:16:45 | 2022-09-21 13:16:45 |
| 29 | foo | 2022-09-21 14:19:10 | 2022-09-21 14:19:10 |
| 30 | foo | 2022-09-21 14:19:20 | 2022-09-21 14:19:20 |
| 31 | foo | 2022-09-21 14:29:19 | 2022-09-21 14:29:19 |
Skip other (nested) attributes by specifying the only one you want to see in the resulting output:
(
spark
.read
.option("multiline","true")
.json("/path/json-path")
.select("payload.*")
.show()
)

how to jq by the desired key is inside nested json

Here is the id.json
{
"name": "peter",
"path": "desktop/name",
"description": "male",
"env1": {
"school": "AAA",
"height": "150",
"weight": "80"
},
"env2": {
"school": "BBB",
"height": "160",
"weight": "70"
}
}
it can be more env3, env4, etc created automatically
I am trying to get the env1 by using height and weight as key
so the output can look like:
env1:height:150
env1:weight:80
env2:height:160
env2:weight:70
env3:height:xxx
.
.
.
My shell command jq .env1.height... id.json tried can only get the output by using env1, env2 as key, but it cannot handle env3, env4. And also, using jq to_entries[] to convert the json defined by key and value, but the first few row made me cannot get .value.weight as output. Any idea please?
Update:
edited the json to remove these three line
"name": "peter",
"path": "desktop/name",
"description": "male",
Then run below command:
jq 'to_entries[] | select(.value.height!=null) | [.key, .value.height, .value.weight]' id2.json
I can get below result
[
"dev",
"1",
"1"
]
[
"sit",
"1",
"1"
]
This is almost what I need, but any idea to remove the outer level json please?
Using your data as initially presented, the following jq program:
keys_unsorted[] as $k
| select($k|startswith("env"))
| .[$k] | to_entries[]
| select(.key|IN("height","weight"))
| [$k, .key, .value]
| join(":")
produces
env1:height:150
env1:weight:80
env2:height:160
env2:weight:70
An answer to the supplementary question
According to one interpretation of the supplementary question,
a solution would be:
keys_unsorted[] as $k
| .[$k]
| objects
| select(.height and .weight)
| to_entries[]
| select(.key|IN("height","weight"))
| [$k, .key, .value]
| join(":")
Equivalently, but without the redundancy:
["height","weight"] as $hw
| keys_unsorted[] as $k
| .[$k]
| objects
| . as $object
| select(all($hw[]; $object[.]))
| $hw[]
| [$k, ., $object[.]]
| join(":")

How to not let jq interpret the newline character when exporting to CSV

I want to convert the following JSON content stored in a file tmp.json
{
"results": [
[
{
"field": "field1",
"value": "value1-1"
},
{
"field": "field2",
"value": "value1-2\n"
}
],
[
{
"field": "field1",
"value": "value2-1"
},
{
"field": "field2",
"value": "value2-2\n"
}
]
]
}
into a CSV output
"field1","field2"
"value1-1","value1-2\n"
"value2-1","value2-2\n"
When I use this jq command, however,
cat tmp.json | jq -r '.results | (first | map(.field)), (.[] | map(.value)) | #csv'
I get this result:
"field1","field2"
"value1-1","value1-2
"
"value2-1","value2-2
"
How should the jq command be written to get the desired CSV result?
For a jq-only solution, you can use gsub("\n"; "\\n"). I'd go with something like this:
.results
| (.[0] | map(.field)),
(.[] | map( .value | gsub("\n"; "\\n")))
| #csv
Using your JSON and invoking this with the -r command line option yields:
"field1","field2"
"value1-1","value1-2\n"
"value2-1","value2-2\n"
If newlines are the only thing you can handle, maybe you can do a string replacement.
cat tmp.json | jq -r '.results | (first | map(.field)), (.[] | map(.value) | map(gsub("\\n"; "\\n"))) | #csv'

How to tabulate nested JSON file with jq

I have the following JSON file that I would like to parse with jq tool that someone suggested me but I'm new with it. There are 3 parents nodes
with the same children names. The parent nodes are MNRs, GNRs and MSNRs and each of them has children named N1, N2, NR_i, NR_f.
{
"Main": {
"Document": "Doc.1",
"Cini": "DDFR",
"List": {
"SubList": {
"CdTa": "ABC",
"NN": "XYZ",
"ND": {
"RiS": {
"RiN": {
"NSE14": {
"MNRs": {
"MRD": [
{
"NR": {
"N1": "393",
"N2": "720",
"SNR": {
"NR_i": "203",
"NR_f": "49994"
}
}
},
{
"NR": {
"N1": "687",
"N2": "345",
"SNR": {
"NR_i": "55005",
"NR_f": "1229996"
}
}
}
]
},
"GNRs": {
"RD": {
"NR": {
"N1": "649",
"N2": "111",
"SNR": {
"NR_i": "55400",
"NR_f": "877"
}
}
}
},
"MSNRs": {
"NR": [
{
"N1": "748",
"N2": "5624",
"SNR": {
"NR_i": "8746",
"NR_f": "7773"
}
},
{
"N1": "124",
"N2": "54",
"SNR": {
"NR_i": "8847",
"NR_f": "5526"
}
}
]
}
},
"NSE12": {
"MBB": "990",
"MRB": "123"
},
"MGE13": {
"TBB": "849",
"TRB": "113"
}
}
}
}
}
}
}
}
With this code I get the following
.Main.List.SubList.ND.RiS.RiN.NSE14.MNRs.MRD
[
{
"NR": {
"N1": "393",
"N2": "720",
"SNR": {
"NR_i": "203",
"NR_f": "49994"
}
}
},
{
"NR": {
"N1": "687",
"N2": "345",
"SNR": {
"NR_i": "55005",
"NR_f": "1229996"
}
}
}
]
And with these commands I get the a columns of individual values for each children and others null.
.. | .N1?
.. | .N2?
.. | .NR_i?
.. | .NR_f?
I'm far from my desired output since I'd like to extract the children for each parent and tabulate in the
form below.
+------+------+-------+---------+-----+-----+-------+------+-----+------+------+------+
| MNRs | GNRs | MSNRs |
+------+------+-------+---------+-----+-----+-------+------+-----+------+------+------+
| N1 | N2 | NR_i | NR_f | N1 | N2 | NR_i | NR_f | N1 | N2 | NR_i | NR_f |
+------+------+-------+---------+-----+-----+-------+------+-----+------+------+------+
| 393 | 720 | 203 | 49994 | 649 | 111 | 55400 | 877 | 748 | 5624 | 8746 | 7773 |
+------+------+-------+---------+-----+-----+-------+------+-----+------+------+------+
| 687 | 345 | 55005 | 1229996 | | | | | 124 | 54 | 8847 | 5526 |
+------+------+-------+---------+-----+-----+-------+------+-----+------+------+------+
May someone help me with this. Thanks in advance.
Since the nature of the input JSON has only been given by example, let's begin by defining a filter for linearizing .NR:
# Produce a stream of arrays
def linearize:
if type == "array" then .[] | linearize
else [ .N1, .N2, .SNR.NR_i, .SNR.NR_f]
end;
The relevant data can now be extracted while preserving the top-level groups as follows:
.Main.List.SubList.ND.RiS.RiN.NSE14
| [to_entries[]
| [.key]
+ [.value | .. | objects | select(has("NR")) | .NR | [ linearize ]] ]
Because the input JSON is not uniform, it will help to ensure uniformity by augmenting the above pipeline with the following mapping:
| map(if length > 2 then [.[0], [.[1:][][]]] else . end)
This produces a single JSON array structured like this:
[["MNRs",[["393","720","203","49994"]],[["687","345","55005","1229996"]]],
["GNRs", ...
To obtain the first data row of the table from this intermediate result, it will be worthwhile defining a function that will provide the necessary padding:
def row($i; $padding):
. as $in
| [range(0;$padding) | null] as $nulls
| reduce range(0; length) as $ix
([]; . + ($in[$ix][1][$i] // $nulls));
Now the first data row can be obtained by row(0;4), the second by row(1;4), etc.
The total number of data rows would be given by filtering the intermediate data structure through map(.[1] | length) | max; thus, the data rows can be obtained by tacking the following onto the previous pipeline:
| (map(.[1] | length) | max) as $rows
| range(0; $rows) as $r
| row($r; 4)
| #tsv
Using the -r command-line option and the given sample, the output would be:
393 720 203 49994 649 111 55400 877 748 5624 8746 7773
687 345 55005 1229996 124 54 8847 5526
Adding the headers is left as an exercise :-)