jq parse json with stream flag into different json file - json
I have a json file as below called data.json, I want to parse the data with jq tool in streaming mode(do not load the whole file into memory), because the real data have 20GB
the streaming mode in jq seems to add a flag --stream and it will parse the json file row by row
{
"id": {
"bioguide": "E000295",
"thomas": "02283",
"govtrack": 412667,
"opensecrets": "N00035483",
"lis": "S376"
},
"bio": {
"gender": "F",
"birthday": "1970-07-01"
},
"tooldatareports": [
{
"name": "A",
"tooldata": [
{
"toolid": 12345,
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"toolid": 12346,
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
}
]
}
The final result I hope it can become as below
A list contains two dict, each dict contain 2 keys
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
For this problem, I use the below command line to get a result, but it still has some differences.
cat data.json | jq --stream 'select(.[0][0]=="tooldatareports" and .[0][2]=="tooldata" and .[1]!=null) | .'
the result is not a list contain a lot of dict
for each time and value are separate in the different list
Does anyone have any idea about this?
Here's a solution that does not use truncate_stream:
jq -n --stream '
[fromstream(
inputs
| (.[0] | index("data")) as $ix
| select($ix)
| .[0] |= .[$ix:] )]
' input.json
The following produces the required output:
jq -n --stream '
[{data: fromstream(5|truncate_stream(inputs))}]
' input.json
Needless to say, there are other variations ...
Here's a step-by-step explanation of peak's answers.
First let's convert the json to stream.
https://jqplay.org/s/VEunTmDSkf
[["id","bioguide"],"E000295"]
[["id","thomas"],"02283"]
[["id","govtrack"],412667]
[["id","opensecrets"],"N00035483"]
[["id","lis"],"S376"]
[["id","lis"]]
[["bio","gender"],"F"]
[["bio","birthday"],"1970-07-01"]
[["bio","birthday"]]
[["tooldatareports",0,"name"],"A"]
[["tooldatareports",0,"tooldata",0,"toolid"],12345]
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"toolid"],12346]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
[["tooldatareports",0,"tooldata",1]]
[["tooldatareports",0,"tooldata"]]
[["tooldatareports",0]]
[["tooldatareports"]]
Now do .[0] to extract the path portion of stream.
https://jqplay.org/s/XdPrp8RuEj
["id","bioguide"]
["id","thomas"]
["id","govtrack"]
["id","opensecrets"]
["id","lis"]
["id","lis"]
["bio","gender"]
["bio","birthday"]
["bio","birthday"]
["tooldatareports",0,"name"]
["tooldatareports",0,"tooldata",0,"toolid"]
["tooldatareports",0,"tooldata",0,"data",0,"time"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"time"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"time"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2]
["tooldatareports",0,"tooldata",0,"data"]
["tooldatareports",0,"tooldata",1,"toolid"]
["tooldatareports",0,"tooldata",1,"data",0,"time"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"time"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"time"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2]
["tooldatareports",0,"tooldata",1,"data"]
["tooldatareports",0,"tooldata",1]
["tooldatareports",0,"tooldata"]
["tooldatareports",0]
["tooldatareports"]
Let me first quickly explain index\1.
index("data") of [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] is 4 since that is the index of the first occurrence of "data".
Knowing that let's now do .[0] | index("data").
https://jqplay.org/s/ny0bV1xEED
null
null
null
null
null
null
null
null
null
null
null
4
4
4
4
4
4
4
4
4
4
4
null
4
4
4
4
4
4
4
4
4
4
4
null
null
null
null
As you can see in our case the indexes are either 4 or null. We want to filter each input such that the corresponding index is not null. Those are the input that have "data" as part of their path.
(.[0] | index("data")) as $ix | select($ix) does just that. Remember that each $ix is mapped to each input. So only input with their $ix being not null are displayed.
For example see https://jqplay.org/s/NwcD7_USZE Here inputs | select(null) gives no output but inputs | select(true) outputs every input.
These are the filtered stream:
https://jqplay.org/s/SgexvhtaGe
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
Before we go further let's review update assignment.
Have a look at https://jqplay.org/s/g4P6j8f9FG
Let's say we have input [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"].
Then filter .[0] |= .[4:] produces [["data",0,"time"],"2021-01-01"].
Why?
Remember that right hand side (.[4:]) inherits the context of the left hand side(.[0]). So in this case it has the effect of updating the path ["tooldatareports",0,"tooldata",0,"data",0,"time"] to ["data",0,"time"].
Let's move on then.
So (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] has the output:
https://jqplay.org/s/AwcQpVyHO2
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],1]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],10]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],5]
[["data",2,"value"]]
[["data",2]]
[["data"]]
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],10]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],100]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],50]
[["data",2,"value"]]
[["data",2]]
[["data"]]
Now all we need to do is convert this stream back to json.
https://jqplay.org/s/j2uyzEU_Rc
[fromstream(inputs)] gives:
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
This is the output we wanted.
Related
parsing jq returns null
I have a json output { "7": [ { "devices": [ "/dev/sde" ], "name": "osd-block-dcc9b386-529c-451e-9d84-8ccc4091102b", "tags": { "ceph.crush_device_class": "None", "ceph.db_device": "/dev/nvme0n1p5", "ceph.wal_device": "/dev/nvme0n1p6", }, "type": "block", "vg_name": "ceph-c4de9e90-853e-4569-b04f-8677ef9a8c7a" }, { "path": "/dev/nvme0n1p5", "tags": { "PARTUUID": "69712eb4-be52-4618-ba46-e317d6d3d76e" }, "type": "db" } ], "41": [ { "devices": [ "/dev/nvme1n1p13" ], "name": "osd-block-97bce07f-ae98-4fdb-83a9-9fa2f35cee60", "tags": { "ceph.crush_device_class": "None", }, "type": "block", "vg_name": "ceph-c1d48671-2a33-4615-95e3-cc1b18783f0c" } ], "9": [ { "devices": [ "/dev/sdf" ], "name": "osd-block-35323eb8-17c1-460d-8cc5-565f549e6991", "tags": { "ceph.crush_device_class": "None", "ceph.db_device": "/dev/nvme0n1p7", "ceph.wal_device": "/dev/nvme0n1p8", }, "type": "block", "vg_name": "ceph-9488e8b8-ec18-4860-93d3-6a1ad91c698c" }, { "path": "/dev/nvme0n1p7", "tags": { "PARTUUID": "ef0e9588-2a20-4c2c-8b62-d73945e01322" }, "type": "db" } ] } Required output: osd.7 /dev/sde /dev/nvme0n1p5 /dev/nvme0n1p6 osd.41 /dev/nvme1n1p13 n/a n/a osd.9 /dev/sdf /dev/nvme0n1p7 /dev/nvme0n1p7 Problems: When I try parsing using jq .[][].devices, I get null values: $ cat json | jq .[][].devices [ "/dev/sde" ] null [ "/dev/nvme1n1p13" ] null [ "/dev/sdf" ] null I can solve it via jq .[][].devices[]?. However, this trick doesn't help me when I do want to see where there's no value (to print n/a instead): $ cat json | jq '.[][].tags | ."ceph.db_device"' "/dev/nvme0n1p5" null "/dev/nvme0n1p3" null null "/dev/nvme0n1p7" null And finally, I try to create a table: $ cat json | jq -r '["osd."+keys[]], [.[][].devices[]?], [.[][].tags."ceph.db_device" // ""] | #csv' | column -t -s, "osd.7" "osd.41" "osd.9" "/dev/sde" "/dev/nvme0n1p13" "/dev/sdf" "/dev/nvme0n1p5" "/dev/nvme0n1p7" So the obvious problem is that the 3rd row doesn't match the correct values. And the final problem is how do I transpose it from columns to rows, as detailed in the required output?
Would this do what you want? jq --raw-output ' to_entries[] | [ "osd." + .key, ( .value[0] | .devices[], ( .tags | ."ceph.db_device" // "n/a", ."ceph.wal_device" // "n/a" ) ) ] | #tsv ' osd.7 /dev/sde /dev/nvme0n1p5 /dev/nvme0n1p6 osd.41 /dev/nvme1n1p13 n/a n/a osd.9 /dev/sdf /dev/nvme0n1p7 /dev/nvme0n1p8 Demo
Group and count JSON using jq [duplicate]
This question already has an answer here: How to group a JSON by a key and sort by its count? (1 answer) Closed 1 year ago. I am trying to convert the following JSON into a csv which has each unique "name" and the total count (i.e: number of times that name appears). Current data: [ { "name": "test" }, { "name": "hello" }, { "name": "hello" } ] Ideal output: [ { "name": "hello", "count": 2 }, { "name": "test", "count": 1 } ] I've tried [.[] | group_by (.name)[] ] but get the following error: jq: error (at :11): Cannot index string with string "name" JQ play link: https://jqplay.org/s/fWqNUii1b2 Note, I am already using jq to format the initial raw data into the format above. Please see the JQ play link here: https://jqplay.org/s/PwwRYscmBK
group_by(.name) | map({name: .[0].name, count: length}) [ { "name": "hello", "count": 2 }, { "name": "test", "count": 1 } ] Jq▷Play Based on OP's comment, use the following jq filter to count each name across multiple objects, where the .name is nested. map(.labels) | map({name: .[0].name, count: length}) Jq▷Play
echo '[{"name": "test"}, {"name": "hello"}, {"name": "hello"}]' | jq 'group_by (.name)[] | {name: .[0].name, count: length}' | jq -s [ { "name": "hello", "count": 2 }, { "name": "test", "count": 1 } ]
Use JQ to output JSON nested object into array, before conversion to CSV
Use JQ to output JSON nested object into array, before conversion to CSV Question is an extension of previous solution: Use JQ to parse JSON array of objects, using select to match specified key-value in the object element, then convert to CSV Data Source: { "Other": [], "Objects": [ { "ObjectElementName": "Test 123", "ObjectElementArray": [], "ObjectNested": { "0": 20, "1": 10.5 }, "ObjectElementUnit": "1" }, { "ObjectElementName": "Test ABC 1", "ObjectElementArray": [], "ObjectNested": { "0": 0 }, "ObjectElementUnit": "2" }, { "ObjectElementName": "Test ABC 2", "ObjectElementArray": [], "ObjectNested": { "0": 15, "1": 20 }, "ObjectElementUnit": "5" } ], "Language": "en-US" } JQ command to extract [FAILS] jq -r '.Objects[] | select(.ObjectElementName | test("ABC")) | [.ObjectElementName,.ObjectNested,.ObjectElementUnit] |#csv' input.json Output CSV required (or variation, so long as ObjectNested appears into a single column in CSV) ObjectElementName,ObjectNested,ObjectElementUnit "Test ABC 1","0:0","2" "Test ABC 2","0:15,1:20","5"
With keys_unsorted and string interpolation, it's easy to turn ObjectNested into the form you desired: .Objects[] | select(.ObjectElementName | index("ABC")) | [ .ObjectElementName, ([.ObjectNested | keys_unsorted[] as $k | "\($k):\(.[$k])"] | join(",")), .ObjectElementUnit ] | #csv
jq: count nest object values based on group by
Json: [ { "account": "1", "cost": [ { "usage":"low", "totalcost": "2.01" } ] }, { "account": "2", "cost": [ { "usage":"low", "totalcost": "2.25" } ] }, { "account": "1", "cost": [ { "usage":"low", "totalcost": "15" } ] }, { "anotheraccount": "a", "cost": [ { "usage":"low", "totalcost": "2" } ] } ] Results expected: account cost 1 17.01 2 2.25 anotheraccount cost a 2 I am able to pull out data but not sure how to aggregate it. jq '.[] | {account,cost : .cost[].totalcost}' Is there a way to do this in using jq, so I get all types of accounts and costs associated with them?
Two helper functions will help you get you to your destination: def sigma( f ): reduce .[] as $o (null; . + ($o | f )) ; def group( keyname ): map(select(has(keyname))) | group_by( .[keyname] ) | map({(keyname) : .[0][keyname], cost: sigma(.cost[].totalcost | tonumber) }) ; With these, the following invocations: group("account"), group("anotheraccount") yield: [{"account":"1","cost":17.009999999999998},{"account":"2","cost":2.25}] [{"anotheraccount":"a","cost":2}] You should be able to manage the final formating step in jq.
jq get the value of x based on y in a complex json file
jq strikes again. Trying to get the value of DATABASES_DEFAULT based on the name in a json file that has a whole lot of names and I'm completely lost. My file looks like the following (output of an aws ecs describe-task-definition) only much more complex; I've stripped this to the most basic example I can where the structure is still intact. { "taskDefinition": { "status": "bar", "family": "bar2", "volumes": [], "taskDefinitionArn": "bar3", "containerDefinitions": [ { "dnsSearchDomains": [], "environment": [ { "name": "bar4", "value": "bar5" }, { "name": "bar6", "value": "bar7" }, { "name": "DATABASES_DEFAULT", "value": "foo" } ], "name": "baz", "links": [] }, { "dnsSearchDomains": [], "environment": [ { "name": "bar4", "value": "bar5" }, { "name": "bar6", "value": "bar7" }, { "name": "DATABASES_DEFAULT", "value": "foo2" } ], "name": "boo", "links": [] } ], "revision": 1 } } I need the value of DATABASES_DEFAULT where the name is baz. Note that there are a lot of keypairs with name, I'm specifically talking about the one outside of environment. I've been tinkering with this but only got this far before realizing that I don't understand how to access nested values. jq '.[] | select(.name==DATABASES_DEFAULT) | .value' which is returning jq: error: DATABASES_DEFAULT/0 is not defined at <top-level>, line 1: .[] | select(.name==DATABASES_DEFAULT) | .value jq: 1 compile error Obviously this a) doesn't work, and b) even if it did, it's independant of the name value. My thought was to return all the db defaults and then identify the one with baz, but I don't know if that's the right approach.
I like to think of it as digging down into the structure, so first you open the outer layers: .taskDefinition.containerDefinitions[] Now select the one you want: select(.name =="baz") Open the inner structure: .environment[] Select the desired object: select(.name == "DATABASES_DEFAULT") Choose the key you want: .value Taken together: parse.jq .taskDefinition.containerDefinitions[] | select(.name =="baz") | .environment[] | select(.name == "DATABASES_DEFAULT") | .value Run it like this: <infile jq -f parse.jq Output: "foo"
The following seems to work: .taskDefinition.containerDefinitions[] | select( select( .environment[] | .name == "DATABASES_DEFAULT" ).name == "baz" ) The output is the object with the name key mapped to "baz". $ jq '.taskDefinition.containerDefinitions[] | select(select(.environment[]|.name == "DATABASES_DEFAULT").name=="baz")' tmp.json { "dnsSearchDomains": [], "environment": [ { "name": "bar4", "value": "bar5" }, { "name": "bar6", "value": "bar7" }, { "name": "DATABASES_DEFAULT", "value": "foo" } ], "name": "baz", "links": [] }